Skip to content

Inconsistent .values for tz-aware columns #14052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sdementen opened this issue Aug 20, 2016 · 34 comments
Closed

Inconsistent .values for tz-aware columns #14052

sdementen opened this issue Aug 20, 2016 · 34 comments
Assignees
Labels
API Design Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype

Comments

@sdementen
Copy link
Contributor

When getting the values of a dataframe with multiple columns with a single type datetime64[ns,UTC], it returns a numpy array with dtype=object instead of dtype=datetime64[ns,UTC].
When returning a single column with dtype=datetime64[ns,UTC] or returning a multiple columns of dtype=datetime64[ns] (ie without tz), it works.
Looking at the code, it appears that SingleBlockManager.is_mixed_type returns False in the first case (datetime64[ns,UTC]) and True in the two other cases (single datetime64[ns,UTC] or multiple datetime64[ns]).

Code Sample, a copy-pastable example if possible

import pandas

idx = pandas.date_range("2010", "2011", tz="UTC")
df = pandas.DataFrame(index=idx)
df["A"] = idx
df["B"] = idx
print(df[["B"]].values.dtype)
print(df[["A", "B"]].values.dtype)

Actual Output

datetime64[ns, UTC]
object

Expected Output

datetime64[ns, UTC]
datetime64[ns, UTC]

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext)
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Aug 24, 2016

When you so .values on a DataFrame, you get an object which can hold all of the data. This should always be a numpy-array, which is object type when it cannot hold the data-type directly (as it is for multiple tz-aware columns, EVEN IF THEY ARE THE SAME tz). However for a single column, we return the underlying data, which is implemented with a DatetimeIndex. I suppose this is a bit inconsistent, but is not likely to change in the current implementation of pandas (in pandas 2.0 things will be more consistent).

In [88]: df[['A','B']].dtypes
Out[88]: 
A    datetime64[ns, UTC]
B    datetime64[ns, UTC]
dtype: object

Here you are getting the actual type

In [90]: df[['A']].values
Out[90]: 
DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08', '2010-01-09', '2010-01-10',
               ...
               '2010-12-23', '2010-12-24', '2010-12-25', '2010-12-26', '2010-12-27', '2010-12-28', '2010-12-29', '2010-12-30', '2010-12-31', '2011-01-01'],
              dtype='datetime64[ns, UTC]', length=366, freq='D')

# this is a combination of the SAME tz-aware series as object dtype
In [91]: df[['A', 'B']].values
Out[91]: 
array([[Timestamp('2010-01-01 00:00:00+0000', tz='UTC'),
        Timestamp('2010-01-01 00:00:00+0000', tz='UTC')],
       [Timestamp('2010-01-02 00:00:00+0000', tz='UTC'),
        Timestamp('2010-01-02 00:00:00+0000', tz='UTC')],
       [Timestamp('2010-01-03 00:00:00+0000', tz='UTC'),
        Timestamp('2010-01-03 00:00:00+0000', tz='UTC')],
       [Timestamp('2010-01-04 00:00:00+0000', tz='UTC'),
        Timestamp('2010-01-04 00:00:00+0000', tz='UTC')],
       [Timestamp('2010-01-05 00:00:00+0000', tz='UTC'),
        Timestamp('2010-01-05 00:00:00+0000', tz='UTC')],
       [Timestamp('2010-01-06 00:00:00+0000', tz='UTC'),
        Timestamp('2010-01-06 00:00:00+0000', tz='UTC')],
       [Timestamp('2010-01-07 00:00:00+0000', tz='UTC'),
        Timestamp('2010-01-07 00:00:00+0000', tz='UTC')],
       [Timestamp('2010-01-08 00:00:00+0000', tz='UTC'),
....

@jreback jreback added this to the 2.0 milestone Aug 24, 2016
@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Timezones Timezone data dtype labels Aug 24, 2016
@jreback
Copy link
Contributor

jreback commented Aug 24, 2016

@wesm prob need some tests cases to nail things like this down.

@sdementen would you like to submit a PR's with some tests for this? (just to nail down the current behavior; we will mark this as a change though).

@jreback jreback added the Bug label Aug 24, 2016
@jreback jreback changed the title SingleBlockManager.is_mixed_type does not handle properly datetime64[ns,UTC] Inconsistent .values for tz-aware columns Aug 24, 2016
@sdementen
Copy link
Contributor Author

If I understand well (correct me if wrong:

  • numpy does not accept tz in their dtype
  • dates are converted to UTC taking into account the pandas tz (so no lost of information except for the tz)
  • in the example here above, .values could have returned a numpy.array with dtype datetime64[ns] but would then loose the tz (but then we would have : df.values == df.astype('M8[ns]').values) ==> this can be confusing for users

But then, how would you see then consistency in pandas 2.0 if numpy does not support tz ?

@jreback, I can write tests in a PR. some questions:

  • where ? pandas/tests/frame/test_values.py ?
  • the test should succeed in the current setting ? or should it fail if the .values does not return a numpy.array ? or if df["A"].values returns a DatetimeIndex instead of a numpy.array ?

@jreback
Copy link
Contributor

jreback commented Aug 25, 2016

dates are converted to UTC taking into account the pandas tz (so no lost of information except for the tz)

not relevant here, we are not converting to UTC, rather using object to hold the data

in the example here above, .values could have returned a numpy.array with dtype datetime64[ns] but would then loose the tz (but then we would have : df.values == df.astype('M8[ns]').values) ==> this can be confusing for users

no, should instead return an object array.

there are tests for datetime w/tz in tests/frame just add them in the same file.
they should just validate the current settings exactly. If you want to try to make this change go for it (though that's a bit more complicated). If we don't make the change, put a TODO there.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

@sdementen can you follow up here?

sdementen added a commit to sdementen/pandas that referenced this issue Mar 16, 2017
With respect to the discussion in pandas-dev#14052, the case ["B","B"] (multiple columns of datetime64[ns, tz] with a common tz) could be improved to return an array with the same dtype as the dataframe
@sdementen
Copy link
Contributor Author

@jreback is that a good test for the case ? I hope I got the issue right finally ...

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

yep your example from the top replicates

@jreback jreback modified the milestones: Next Major Release, 2.0 Mar 16, 2017
@sdementen
Copy link
Contributor Author

So I start a PR?

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

@sdementen

yes, would be appreciated.

see docs here

create a test, then see if you can track down where to put the fix.

@sdementen
Copy link
Contributor Author

I was just thinking about the PR to add the test to pandas (with a todo item) as you wrote it would be difficult before pandas 2.0. OK for you ?

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

what I wrote about is that its difficult to change df['A'].values vs df[['A', 'B']].values, meaning that numpy cannot hold the tz-ware datetimes in a datetime64[ns] array unless they are UTC, vs. when we return an object array of Timestamps (which are tz-aware). So this part we can't do anything about ATM.

But would take a bug fix for making df[['A']].values return an object array of Timestamps (rather than a DatetimeIndex).

@sdementen
Copy link
Contributor Author

ok, now I understand it better. But I guess you mean df[['B']].values to be a numpy.ndarray with dtype == datetime64[ns] as when I print(col, type(df.values)) I get

(['A'], <type 'numpy.ndarray'>)
(['A', 'A'], <type 'numpy.ndarray'>)
(['A', 'B'], <type 'numpy.ndarray'>)
(['B'], <class 'pandas.tseries.index.DatetimeIndex'>)
(['B', 'B'], <type 'numpy.ndarray'>)

Side question, can a DatetimeTZBlock be multidimensional like DatetimeBlock ? or will it always have only one dimension (and therefore, very difficult to fix the case for df[['B','B']] to return an array of datetime64[ns,US/Eastern]) ?

@sdementen
Copy link
Contributor Author

for the case df[['B']].values, I can try to fix it

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

the problem is basically here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3427

this needs a bit of logic rework. _interleave_dtype does the correct job of picking out the right dtype. But we skip this logic if we only have a single dtype which is wrong. The reason for this (originally) is that if we only have a single dtype we can simply return it as a numpy array (avoiding copy); and this works for most dtypes, but not for categorical or datetime w/tz which need to be packaged differently.

Side question, can a DatetimeTZBlock be multidimensional like DatetimeBlock ? or will it always have only one dimension (and therefore, very difficult to fix the case for df[['B','B']] to return an array of datetime64[ns,US/Eastern]) ?

no these are single column only (sparse, categorical, datetime w/tz).

@sdementen
Copy link
Contributor Author

would the following fit the bill to replace https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3427 ? it passes my adapted test

        if ((self._is_single_block or not self.is_mixed_type)
            and not isinstance(mgr.blocks[0], (DatetimeTZBlock,CategoricalBlock))):
            return mgr.blocks[0].get_values()
        else:
            return mgr._interleave()

sdementen added a commit to sdementen/pandas that referenced this issue Mar 16, 2017
do not bypass _interleave() if single block but of type DatetimeTZBlock or CategoricalBlock
@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

I would like to eliminate whole if bizness and instead have this done in interleave (with a single block not copying)

@sdementen
Copy link
Contributor Author

Ok, i'll continue tomorrow/later.
Keep me informed if someone else works on it in between to avoid duplicate work

@sdementen
Copy link
Contributor Author

But on fact, is it worth to do it in a cleaner way if pandas 2.0 is on the horizon ?

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

pandas 2.0 is on the horizon

we are at least 1 yr away

so bug fixes will continue to be expected for pandas 1 for the foreseable future (and even after pandas 2 is production ready).

@sdementen
Copy link
Contributor Author

I have made some adaptations. Are they going in the right direction ?

@jreback
Copy link
Contributor

jreback commented Mar 16, 2017

your best bet is to create a PR.

of course before you push you should have a test that fails before the change and passes afterwards. then you can iterate on the code.

@sdementen
Copy link
Contributor Author

I am making progress but I am having failures in other tests.

I have temporarily removed the shortcut for df.values that returns an ndarray without copy when the dtype is unique ==> so df.values always returns a new ndarray (never a 'view')

But now I see a failure in https://github.com/pandas-dev/pandas/blob/master/pandas/tests/test_generic.py#L1469 due to pct_change assuming df.values is a view https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5786 and using np.putmask to does inplace changes in the df via df.values (https://docs.scipy.org/doc/numpy/reference/generated/numpy.putmask.html).

Is this an abusive use of df.values ? or is some code assuming rightfully that sometimes df.values returns a view and sometimes a copy ?

@sdementen
Copy link
Contributor Author

I guess https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5786 should be replaced by
rs.mask(mask, np.nan,inplace=True)

sdementen added a commit to sdementen/pandas that referenced this issue Mar 18, 2017
@jreback
Copy link
Contributor

jreback commented Mar 18, 2017

@sdementen yes this certainly could show some incorrect usages

@sdementen
Copy link
Contributor Author

I feel like going down the rabbit hole ...

Currently, we have the following

from pandas import *
d=date_range("20120101",periods=4,tz="CET")
df=pandas.DataFrame({"A":d.values,"B":d},index=d)
print(repr(df[["B"]].values))
print(repr(df["B"].values))
print(repr(df.index.values))

that returns

DatetimeIndex(['2012-01-01 00:00:00+01:00', '2012-01-02 00:00:00+01:00',
               '2012-01-03 00:00:00+01:00', '2012-01-04 00:00:00+01:00'],
              dtype='datetime64[ns, CET]', freq='D')
array(['2011-12-31T23:00:00.000000000', '2012-01-01T23:00:00.000000000',
       '2012-01-02T23:00:00.000000000', '2012-01-03T23:00:00.000000000'], dtype='datetime64[ns]')
array(['2011-12-31T23:00:00.000000000', '2012-01-01T23:00:00.000000000',
       '2012-01-02T23:00:00.000000000', '2012-01-03T23:00:00.000000000'], dtype='datetime64[ns]')

The DatetimeIndex is the problematic one. But should this return an array of object (as I understood from our thread) or be consistent with the other two (series and dti) and return a datetime64[ns] but with tz scrapped ?

@jreback
Copy link
Contributor

jreback commented Mar 18, 2017

you are confusing a Series.values; that is fine and expected (it does get converted to a numpy array and hence UTC).

We are only changing a DataFrame conversion, e.g. df[['B']] or df[['A', 'B']]. These will BOTH return a 2-d array (the 1st will have (1,n) and 2nd of course will have (2,n) dims)

@sdementen
Copy link
Contributor Author

sorry to be not that clear in my previous post

I was illustrating indeed what Series, a DatetimeIndex and a DataFrame were giving as a result of using their property values. In terms of API design, I would expect these 3 to be "consistent/coherent" in some way.
As both Series and DatetimeIndex returns ndarray of dtype datetime64[ns] in UTC but without the tz information (hence, we loose the fact that the datetime64[ns] were tz aware vs naive in pandas world, but OK, given the constrains of numpy, it looks a reasonable assumption), I would expect the df[['B']] or df[['A','B']] to also return ndarray of dtype datetime64[ns] in UTC. This would be equivalent to applying values to each columns (as a series) in the dataframe and then concatenating the results. This would also mean that, if the dataframe contains only datetime64[ns] and/or datetime64[ns, any tz], it would be converted to a ndarray of dtype datetime64[ns].

Hence, I am a bit far from the previous goal of df[['B']] returning an ndarray of dtype object with tz-aware Timestamp inside.

To make everything even clearer, the following script shows the 3 cases (["A"], ["B"], ["A", "B"]) for which values return 3 different kind of objects, as well as 2 scenarios to return consistent values.

import numpy
from pandas import date_range, DataFrame

def to_values(df, as_dt64=False):
    return numpy.array(list(df.itertuples(index=False)),
                       dtype="<M8[ns]" if as_dt64 else object)

df = DataFrame({"A": date_range("20120101", periods=2),
                "B": date_range("20120101", periods=2, tz="CET")})

for col in [["A"], ["B"], ["A", "B"]]:
    print("Case : ", col)
    print("1. Current .values")
    print(repr(df[col].values))
    print("2. Exporting Timestamp 'object'")
    arr = to_values(df[col])
    print(repr(arr))
    print("3. Exporting datetime64[ns]")
    arr = to_values(df[col], as_dt64=True)
    print(repr(arr))
    print()

with output

Case :  ['A']
1. Current .values
array([['2012-01-01T00:00:00.000000000'],
       ['2012-01-02T00:00:00.000000000']], dtype='datetime64[ns]')
2. Exporting Timestamp 'object'
array([[Timestamp('2012-01-01 00:00:00')],
       [Timestamp('2012-01-02 00:00:00')]], dtype=object)
3. Exporting datetime64[ns]
array([['2012-01-01T00:00:00.000000000'],
       ['2012-01-02T00:00:00.000000000']], dtype='datetime64[ns]')

Case :  ['B']
1. Current .values
DatetimeIndex(['2012-01-01 00:00:00+01:00', '2012-01-02 00:00:00+01:00'], dtype='datetime64[ns, CET]', freq='D')
2. Exporting Timestamp 'object'
array([[Timestamp('2012-01-01 00:00:00+0100', tz='CET')],
       [Timestamp('2012-01-02 00:00:00+0100', tz='CET')]], dtype=object)
3. Exporting datetime64[ns]
array([['2011-12-31T23:00:00.000000000'],
       ['2012-01-01T23:00:00.000000000']], dtype='datetime64[ns]')

Case :  ['A', 'B']
1. Current .values
array([[Timestamp('2012-01-01 00:00:00'),
        Timestamp('2012-01-01 00:00:00+0100', tz='CET')],
       [Timestamp('2012-01-02 00:00:00'),
        Timestamp('2012-01-02 00:00:00+0100', tz='CET')]], dtype=object)
2. Exporting Timestamp 'object'
array([[Timestamp('2012-01-01 00:00:00'),
        Timestamp('2012-01-01 00:00:00+0100', tz='CET')],
       [Timestamp('2012-01-02 00:00:00'),
        Timestamp('2012-01-02 00:00:00+0100', tz='CET')]], dtype=object)
3. Exporting datetime64[ns]
array([['2012-01-01T00:00:00.000000000', '2011-12-31T23:00:00.000000000'],
       ['2012-01-02T00:00:00.000000000', '2012-01-01T23:00:00.000000000']], dtype='datetime64[ns]')

As this can lead to backward incompatible changes, I am not sure which option is the best but probably the to_values(df, as_dt64=True) interpretation is the more consistent with Series and DatetimeIndex.

@sdementen
Copy link
Contributor Author

@jreback
Copy link
Contributor

jreback commented Mar 19, 2017

we r leaving the 1d case alone pls focus on this specific issue

if you want to propose making a Series.values for dt w/tz aware return an object based (so we preserve the tz) that would be fine but is separate

(i mean u could do it in the pr but that should not be he primary goal)

@sdementen
Copy link
Contributor Author

sdementen commented Mar 19, 2017

I just noticed that my branch name was very misleading as I wrote "for series with datetime64" (and I understand now some of your comments...) but I have only worked on the dataframe part.
I have renamed the branch to make this clear

@sdementen
Copy link
Contributor Author

I think the main issue (rabbit hole effect) with this issue is that we want to change the signature of the function, not only on the type of object (DateTimeIndex to ndarray) but also the shape of it as show hereunder. The following example

df = DataFrame({"A": date_range("20120101", periods=2),
                "B": date_range("20120101", periods=2, tz="CET"),
                })

print(df[["A"]].values.shape)
print(df[["B"]].values.shape)

outputs in pandas 0.19.2

(2, 1)
(2,)

and if we fix the issue, it will output

(2, 1)
(2, 1)

so any code within and outside pandas that relies on the fact that the shape is (2,) will break.

Am I correct when stating this ? Or did I misunderstand something ?

@sdementen
Copy link
Contributor Author

@jreback any feedback on comment here above about shape of results ?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

I commented on your PR. easier to discuss actual code.

Yes the shape will change, it was wrong before. (that will be part of the whatsnew showing that.)

@mroeschke
Copy link
Member

I believe this is fixed in master when we enforced datetime with tz dtypes to be stored as object arrays with Timestamps

In [7]: df[["A", "B"]].values[:5]
Out[7]:
array([[Timestamp('2010-01-01 00:00:00+0000', tz='UTC', freq='D'),
        Timestamp('2010-01-01 00:00:00+0000', tz='UTC', freq='D')],
       [Timestamp('2010-01-02 00:00:00+0000', tz='UTC', freq='D'),
        Timestamp('2010-01-02 00:00:00+0000', tz='UTC', freq='D')],
       [Timestamp('2010-01-03 00:00:00+0000', tz='UTC', freq='D'),
        Timestamp('2010-01-03 00:00:00+0000', tz='UTC', freq='D')],
       [Timestamp('2010-01-04 00:00:00+0000', tz='UTC', freq='D'),
        Timestamp('2010-01-04 00:00:00+0000', tz='UTC', freq='D')],
       [Timestamp('2010-01-05 00:00:00+0000', tz='UTC', freq='D'),
        Timestamp('2010-01-05 00:00:00+0000', tz='UTC', freq='D')]],
      dtype=object)

In [8]: df[['B']].values[:5]
Out[8]:
array([[Timestamp('2010-01-01 00:00:00+0000', tz='UTC', freq='D')],
       [Timestamp('2010-01-02 00:00:00+0000', tz='UTC', freq='D')],
       [Timestamp('2010-01-03 00:00:00+0000', tz='UTC', freq='D')],
       [Timestamp('2010-01-04 00:00:00+0000', tz='UTC', freq='D')],
       [Timestamp('2010-01-05 00:00:00+0000', tz='UTC', freq='D')]],
      dtype=object)

In [9]: pd.__version__
Out[9]: '0.25.0rc0'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants