Skip to content

DataFrame.values not a 2D-array when constructed from timezone-aware datetimes #13407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aburgm opened this issue Jun 9, 2016 · 4 comments · Fixed by #20422
Closed

DataFrame.values not a 2D-array when constructed from timezone-aware datetimes #13407

aburgm opened this issue Jun 9, 2016 · 4 comments · Fixed by #20422
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Milestone

Comments

@aburgm
Copy link

aburgm commented Jun 9, 2016

When a DataFrame column is constructed from timezone-aware datetime objects, its values attribute returns a pandas.DatetimeIndex instead of a 2D numpy array. This is problematic because the datetime index does not support all operations that a numpy array does.

Code Sample, a copy-pastable example if possible

import datetime
import dateutil
import pandas
import numpy as np
df = pandas.DataFrame()
df['Time'] = [datetime.datetime(2015,1,1,tzinfo=dateutil.tz.tzutc())]
df.dropna(axis=0) # raises ValueError: 'axis' entry is out of bounds

Also, print df.values returns DatetimeIndex(['2015-01-01'], dtype='datetime64[ns, UTC]', freq=None).

Expected Output

The df.dropna call should be a no-op.

Compare this to the case when constructed using df['Time'] = [datetime.datetime(2015,1,1)]. In that case, df.dropna works as expected, and df.values is array([['2014-12-31T16:00:00.000000000-0800']], dtype='datetime64[ns]').

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 8.0.2
setuptools: 20.1.1
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
@jreback
Copy link
Contributor

jreback commented Jun 9, 2016

This must be shortcutting on the 1-d case in lcd_types. If you have an additional column it will work.

In [10]: df['foo'] = 'bar'

In [11]: df.values
Out[11]: array([[Timestamp('2015-01-01 00:00:00+0000', tz='UTC'), 'bar']], dtype=object)

Note that using .values is pretty inefficient, nor does numpy support these extended dtypes.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2016

This statement is non-sensical, as numpy barley understands timezones. And an Index API is a super-set of 1-d numpy operations.

This is problematic because the datetime index does not support all operations that a numpy array does.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Timezones Timezone data dtype labels Jun 9, 2016
@jreback jreback added this to the Next Major Release milestone Jun 9, 2016
@aburgm
Copy link
Author

aburgm commented Jun 9, 2016

Agreed; I did not phrase that properly. What I meant is that some operations on a pandas dataframe, such as dropna along the 0-axis, fail, apparently because the index is a 1-D structure where a 2-D structure was expected.

@npdoty
Copy link

npdoty commented Feb 11, 2018

Spent a long time debugging this today. It is very surprising to have dropna throw an AxisError exception in numpy when the column happens to be datetime64[ns, UTC] but not when it's datetime64[ns].

As a workaround, I believe users can use df[df.column_name.notnull()] (as described on StackOverflow) instead of dropna(subset=['column name']).

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Mar 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants