-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
KeyError when comparing DataFrame with tz-aware DatetimeIndex on columns with DST change #19970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If you're familiar with |
Checked out git bisect-- it appears the issue is with #18618 (CC: @jbrockmendel) [c0e3767] handle DST appropriately in Timestamp.replace (#18618) I'm not familiar with Cython and it's unclear to me why this commit would have this effect. This was the unit test I wrote for the git bisect, if that helps: def test_transpose_dst_changes(self):
dstchgidx = date_range('20160101', '20161130', freq='4H',
tz='America/New_York')
df = DataFrame({'a':np.arange(len(dstchgidx))},
dstchgidx)
assert_frame_equal(df, df)
assert_frame_equal(df.T, df.T) |
@jbrockmendel @TomAugspurger This regression from 0.22 is persisting in 0.23.0rc2. |
Yep, still open. Have you had a chance to look into a fix?
…On Wed, May 2, 2018 at 2:24 PM, David Krych ***@***.***> wrote:
@jbrockmendel <https://github.com/jbrockmendel> @TomAugspurger
<https://github.com/TomAugspurger> This regression from 0.22 is
persisting in 0.23.0rc2.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19970 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIk50NpiSRE6zQmjbfGv6jOzLxtZVks5tugfSgaJpZM4SaR7E>
.
|
@TomAugspurger Mostly bumping because of the request for bug reports against the RC. I've never written any Cython at all... was sort of hoping someone who understood #18618 would have an idea what was going on here. That said, I'll take another look, since it might be fixable outside the Cython if we think the Cython logic is correct. |
I'm looking into this now. It looks like the issue is in |
I semi-take it back. The problem is this:
1 AM on 2016-11-06 happens twice in America/New York, and calling the Timestamp constructor here is changing which one we're looking at. |
Thanks @jbrockmendel! In that case would this be a sufficient fix?
Idea here being that if the Timestamp comes in tz-localized we don't subsequently try to re-localize it (and lose the ambiguity resolution that has already taken place)? I notice that pattern occurs in other places in datetimes.py-- get_value, get_value_maybe_box, etc. Those should presumably change as well? |
Tracking it down a little further, I'm now thinking the fix is in
It looks like the suggested solution leaves |
To be honest, I don't know why, but this does seem to work with the above fix:
That said, my understanding of how any of this works is extremely hazy (I've never really looked at any of this code before) so I defer to you as to the best solution. |
@jbrockmendel xref #20854 (comment) I think the problem is in this branch: pandas/pandas/_libs/tslibs/conversion.pyx Lines 352 to 355 in b02c69a
I think |
@mroeschke that's a part of the conversion code that I've been uncomfortable with (see above it the comment "# sort of a temporary hack"). In particular, the thing we may need @jreback to weigh in on is if the |
here's a repro
would expected 68==69 [True] so the round-trip works if we have a datetime but not a Timestamp. here's the doc-string for
I think that the 'hack' here (should remove that line!). is that you have to do this. I think this may have been working around a construction bug elsewhere in pandas, but not really sure. Since this routine can take datetimes or Timestamps, then we prob could just coerce to datetimes at the beginning leave this logic. |
so @jbrockmendel is correct, we need a |
Code Sample, a copy-pastable example if possible
Problem description
On current master, if you have a DataFrame with a tz-aware DatetimeIndex in the columns, comparison can fail with a KeyError. Oddly, it appears this occurs only if the DatetimeIndex is the columns and not on the index.
Based on the KeyError it appears that for some reason .loc is looking for the pre-DST change value when it should be looking at the post-DST value. Reproducing the bug requires at least 2 DST switches in the original DataFrame. DataFrames with just a single DST switch do not seem to exhibit the behavior unless they are views on a larger DataFrame with two switches.
This is new on master; behavior doesn't occur in 0.22.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: e3b87c1
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.0.dev0+422.ge3b87c1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: