-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Inconsistent .values for tz-aware columns #14052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When you so
Here you are getting the actual type
|
@wesm prob need some tests cases to nail things like this down. @sdementen would you like to submit a PR's with some tests for this? (just to nail down the current behavior; we will mark this as a change though). |
If I understand well (correct me if wrong:
But then, how would you see then consistency in pandas 2.0 if numpy does not support tz ? @jreback, I can write tests in a PR. some questions:
|
not relevant here, we are not converting to UTC, rather using
no, should instead return an there are tests for datetime w/tz in tests/frame just add them in the same file. |
@sdementen can you follow up here? |
With respect to the discussion in pandas-dev#14052, the case ["B","B"] (multiple columns of datetime64[ns, tz] with a common tz) could be improved to return an array with the same dtype as the dataframe
@jreback is that a good test for the case ? I hope I got the issue right finally ... |
yep your example from the top replicates |
So I start a PR? |
yes, would be appreciated. see docs here create a test, then see if you can track down where to put the fix. |
I was just thinking about the PR to add the test to pandas (with a todo item) as you wrote it would be difficult before pandas 2.0. OK for you ? |
what I wrote about is that its difficult to change But would take a bug fix for making |
ok, now I understand it better. But I guess you mean
Side question, can a DatetimeTZBlock be multidimensional like DatetimeBlock ? or will it always have only one dimension (and therefore, very difficult to fix the case for df[['B','B']] to return an array of datetime64[ns,US/Eastern]) ? |
for the case |
the problem is basically here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3427 this needs a bit of logic rework.
no these are single column only (sparse, categorical, datetime w/tz). |
would the following fit the bill to replace https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals.py#L3427 ? it passes my adapted test
|
do not bypass _interleave() if single block but of type DatetimeTZBlock or CategoricalBlock
I would like to eliminate whole if bizness and instead have this done in interleave (with a single block not copying) |
Ok, i'll continue tomorrow/later. |
But on fact, is it worth to do it in a cleaner way if pandas 2.0 is on the horizon ? |
we are at least 1 yr away so bug fixes will continue to be expected for pandas 1 for the foreseable future (and even after pandas 2 is production ready). |
I have made some adaptations. Are they going in the right direction ? |
your best bet is to create a PR. of course before you push you should have a test that fails before the change and passes afterwards. then you can iterate on the code. |
I am making progress but I am having failures in other tests. I have temporarily removed the shortcut for df.values that returns an ndarray without copy when the dtype is unique ==> so df.values always returns a new ndarray (never a 'view') But now I see a failure in https://github.com/pandas-dev/pandas/blob/master/pandas/tests/test_generic.py#L1469 due to pct_change assuming df.values is a view https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5786 and using np.putmask to does inplace changes in the df via df.values (https://docs.scipy.org/doc/numpy/reference/generated/numpy.putmask.html). Is this an abusive use of df.values ? or is some code assuming rightfully that sometimes df.values returns a view and sometimes a copy ? |
I guess https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5786 should be replaced by |
@sdementen yes this certainly could show some incorrect usages |
I feel like going down the rabbit hole ... Currently, we have the following
that returns
The DatetimeIndex is the problematic one. But should this return an array of object (as I understood from our thread) or be consistent with the other two (series and dti) and return a datetime64[ns] but with tz scrapped ? |
you are confusing a We are only changing a DataFrame conversion, e.g. |
sorry to be not that clear in my previous post I was illustrating indeed what Series, a DatetimeIndex and a DataFrame were giving as a result of using their property Hence, I am a bit far from the previous goal of df[['B']] returning an ndarray of dtype object with tz-aware Timestamp inside. To make everything even clearer, the following script shows the 3 cases (["A"], ["B"], ["A", "B"]) for which values return 3 different kind of objects, as well as 2 scenarios to return consistent values.
with output
As this can lead to backward incompatible changes, I am not sure which option is the best but probably the |
@jreback FYI, the branch I am using to work on the issue is https://github.com/sdementen/pandas/tree/value-returns-ndarray-for-series-with-datetime64-tz |
we r leaving the 1d case alone pls focus on this specific issue if you want to propose making a Series.values for dt w/tz aware return an object based (so we preserve the tz) that would be fine but is separate (i mean u could do it in the pr but that should not be he primary goal) |
I just noticed that my branch name was very misleading as I wrote "for series with datetime64" (and I understand now some of your comments...) but I have only worked on the dataframe part. |
I think the main issue (rabbit hole effect) with this issue is that we want to change the signature of the function, not only on the type of object (DateTimeIndex to ndarray) but also the shape of it as show hereunder. The following example
outputs in pandas 0.19.2
and if we fix the issue, it will output
so any code within and outside pandas that relies on the fact that the shape is (2,) will break. Am I correct when stating this ? Or did I misunderstand something ? |
@jreback any feedback on comment here above about shape of results ? |
I commented on your PR. easier to discuss actual code. Yes the shape will change, it was wrong before. (that will be part of the whatsnew showing that.) |
I believe this is fixed in master when we enforced datetime with tz dtypes to be stored as object arrays with
|
When getting the values of a dataframe with multiple columns with a single type datetime64[ns,UTC], it returns a numpy array with dtype=object instead of dtype=datetime64[ns,UTC].
When returning a single column with dtype=datetime64[ns,UTC] or returning a multiple columns of dtype=datetime64[ns] (ie without tz), it works.
Looking at the code, it appears that SingleBlockManager.is_mixed_type returns False in the first case (datetime64[ns,UTC]) and True in the two other cases (single datetime64[ns,UTC] or multiple datetime64[ns]).
Code Sample, a copy-pastable example if possible
Actual Output
Expected Output
output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext)
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: