-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Can't restore index from parquet with offset-specified timezone #35997 #36004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @alippai, however the test doesn't seem to be passing? I think we would first need to fix the issue before committing the test. |
Like with TDD, you write the expectation as a testcase and it fails, then the code is changed and the untouched test passes. Isn't this the workflow for fixing a bug? I didn't work on pandas codebase before, but I can give it a try. |
BTW Setting up the environment works like a charm, but the thing I did is in the official docs too: https://pandas.pydata.org/docs/dev/development/contributing.html#test-driven-development-code-writing
|
It's a good practice to write tests first, but a PR should include both the test and the fix that makes the test pass (unless the test is already passing of course) |
@dsaxton @jbrockmendel I've pushed a naive fix, what do you think? Should I skip regex or is it OK? |
ddac669
to
00993d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where else is maybe_get_tz used? Curious why this wouldn't have broken anything previously. Also I think this change likely requires more tests in /tests/tslibs/test_timezones.py.
88369b4
to
c410146
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run asv's associated with timezones and see how this performs.
@jreback Run it, there are no significant changes within 10% (beside the fact that |
|
55b6914
to
94c0763
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of the arrow/parquet bug (which I also fixed in pyarrow now), I think this change is probably good, because it basically means that we accept the string repr of datetime.timezone
fixed offsets.
Adding it to the fixture ensures that it is tested in a variety of tests, but are we sure there is somewhere a test that specifically asserts the correctness of the conversion (the code added in maybe_get_tz)? I would maybe add an explicit test for this (not relying on that fixture).
pandas/tests/io/test_parquet.py
Outdated
def test_timezone_aware_index(self, pa, timezone_aware_date_list): | ||
idx = 5 * [timezone_aware_date_list] | ||
|
||
df = pd.DataFrame(index=idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add it as a column (can use the same values) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, actually this assertion fails:
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="index_as_col") are different
Attribute "dtype" are different
[left]: datetime64[ns, UTC-02:15]
[right]: datetime64[ns, pytz.FixedOffset(-135)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche now that pyarrow does something unexpected I'm not sure how to proceed with this PR. Can you advice, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessarily unexpected from pyarrow (it just always uses pytz (for better or worse) for fixed offsets, it only stores the actual fixed offset, not what class was used on the python side, but will answer about that on the Arrow PR). So I would just test the current behaviour: you can create the expected dataframe (with the different timezone) and pass this to check_round_trip
.
Separately from this, I am wondering if we should treat fixed offset datetime.timezone(..)
and pytz.FixedOffset(..)
as equal when comparing dtypes (@jbrockmendel ?). Or at least have an utility or keyword in the assert_
to indicate they can be considered equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I exposed the check_dtype
parameter for now, this way it passes. What do you think?
9cbfb97
to
ee8281b
Compare
@@ -724,6 +744,13 @@ def test_timestamp_nanoseconds(self, pa): | |||
df = pd.DataFrame({"a": pd.date_range("2017-01-01", freq="1n", periods=10)}) | |||
check_round_trip(df, pa, write_kwargs={"version": "2.0"}) | |||
|
|||
def test_timezone_aware_index(self, pa, timezone_aware_date_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change this so that on future versions of pyarrow it will set check_dtype=True (IOW when the bug is fixed). otherwise I am worried this will perpetuate forever.
alternatively, could remove the check_dtype and just xfail this (and when new pyarrow fixes this test will then start to xpass which will make it fail as we have strict=True). so again could make this a conditional on an older / newer version of pyarrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback based on the comment above by @jorisvandenbossche , I don't expect this to be "fixed" in Arrow. pytz.FixedOffset
is a wrapper around timedelta
like datetime.timezone
is and they both implement datetime.tzinfo
. ofc I'm happy to do any change you ask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are no active plans in pyarrow to change this. Although it makes probably sense to use datetime.timezone
, but if we do that, it's pytz.FixedOffset
that won't be preserved, in which case another test probably needs a check_dtype=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I added a long description. I know this won't help a follow up improvement in the future, but at this point it's unlikely to change.
There are two scenarios:
- arrow/parquet persists the initial class information
- the assert_ introduces a sub-feature of check_dtype=True (
check_timezone_dtype=False
by default) acceptingpytz.FixedOffset
anddatetime.timezone
equality
I don't see any of them on the short or mid-term roadmap.
Do you need any changes? What else needed for the merge? |
ok this lgtm. if you'd add a whatsnew note in 1.2 under I/O section can get this in . ping on green. |
@jreback The what's new entry is added, the CI failures are not related. Thanks! |
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -392,6 +392,7 @@ I/O | |||
- Bug in :meth:`read_csv` with ``engine='python'`` truncating data if multiple items present in first row and first element started with BOM (:issue:`36343`) | |||
- Removed ``private_key`` and ``verbose`` from :func:`read_gbq` as they are no longer supported in ``pandas-gbq`` (:issue:`34654`, :issue:`30200`) | |||
- Bumped minimum pytables version to 3.5.1 to avoid a ``ValueError`` in :meth:`read_hdf` (:issue:`24839`) | |||
- String representation for fixed offset timezones were not recognized (:issue:`35997`, :issue:`36004`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry can u in particular call out parquet roundtrioping here as this is where this is likely to affect t the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, how about now?
thanks for hanging in there @alippai nice work! |
This was my first contribution to pandas, thank you @dsaxton, @jbrockmendel, @jorisvandenbossche and @jreback for the help. I really enjoyed working on this and I appreciate the fact that I had close to zero technical issues setting up the dev environment. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff