BUG: Can't restore index from parquet with offset-specified timezone #35997 #36004

alippai · 2020-08-31T08:30:23Z

closes BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index #35997
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2020-08-31T08:30:28Z

Hello @alippai! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-07 00:09:08 UTC

dsaxton · 2020-08-31T19:18:53Z

Thanks @alippai, however the test doesn't seem to be passing? I think we would first need to fix the issue before committing the test.

alippai · 2020-08-31T20:16:06Z

Like with TDD, you write the expectation as a testcase and it fails, then the code is changed and the untouched test passes. Isn't this the workflow for fixing a bug? I didn't work on pandas codebase before, but I can give it a try.

alippai · 2020-08-31T20:38:30Z

BTW Setting up the environment works like a charm, but the thing I did is in the official docs too: https://pandas.pydata.org/docs/dev/development/contributing.html#test-driven-development-code-writing

So, before actually writing any code, you should write your tests. Often the test can be taken from the original GitHub issue.

dsaxton · 2020-08-31T21:32:57Z

It's a good practice to write tests first, but a PR should include both the test and the fix that makes the test pass (unless the test is already passing of course)

alippai · 2020-08-31T21:53:15Z

@dsaxton @jbrockmendel I've pushed a naive fix, what do you think? Should I skip regex or is it OK?

pandas/_libs/tslibs/timezones.pyx

dsaxton

Where else is maybe_get_tz used? Curious why this wouldn't have broken anything previously. Also I think this change likely requires more tests in /tests/tslibs/test_timezones.py.

pandas/_libs/tslibs/timezones.pyx

pandas/tests/io/test_parquet.py

pandas/_libs/tslibs/timezones.pyx

jreback

can you run asv's associated with timezones and see how this performs.

alippai · 2020-09-01T15:44:02Z

@jreback Run it, there are no significant changes within 10% (beside the fact that index_cached_properties.IndexCache.* looks totally random)

alippai · 2020-09-01T17:06:00Z

~~I'm giving this up for now, feel free to grab & fix it~~

pandas/_libs/tslibs/timezones.pyx

pandas/tests/io/test_parquet.py

jorisvandenbossche

Regardless of the arrow/parquet bug (which I also fixed in pyarrow now), I think this change is probably good, because it basically means that we accept the string repr of datetime.timezone fixed offsets.

Adding it to the fixture ensures that it is tested in a variety of tests, but are we sure there is somewhere a test that specifically asserts the correctness of the conversion (the code added in maybe_get_tz)? I would maybe add an explicit test for this (not relying on that fixture).

pandas/tests/io/test_parquet.py

jorisvandenbossche · 2020-09-10T14:22:58Z

pandas/tests/io/test_parquet.py

+    def test_timezone_aware_index(self, pa, timezone_aware_date_list):
+        idx = 5 * [timezone_aware_date_list]
+
+        df = pd.DataFrame(index=idx)


Can you also add it as a column (can use the same values) ?

Nice catch, actually this assertion fails:

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="index_as_col") are different Attribute "dtype" are different [left]: datetime64[ns, UTC-02:15] [right]: datetime64[ns, pytz.FixedOffset(-135)]

@jorisvandenbossche now that pyarrow does something unexpected I'm not sure how to proceed with this PR. Can you advice, please?

It's not necessarily unexpected from pyarrow (it just always uses pytz (for better or worse) for fixed offsets, it only stores the actual fixed offset, not what class was used on the python side, but will answer about that on the Arrow PR). So I would just test the current behaviour: you can create the expected dataframe (with the different timezone) and pass this to check_round_trip.

Separately from this, I am wondering if we should treat fixed offset datetime.timezone(..) and pytz.FixedOffset(..) as equal when comparing dtypes (@jbrockmendel ?). Or at least have an utility or keyword in the assert_ to indicate they can be considered equal.

@jorisvandenbossche I exposed the check_dtype parameter for now, this way it passes. What do you think?

jreback · 2020-09-13T22:38:28Z

pandas/tests/io/test_parquet.py

@@ -724,6 +744,13 @@ def test_timestamp_nanoseconds(self, pa):
        df = pd.DataFrame({"a": pd.date_range("2017-01-01", freq="1n", periods=10)})
        check_round_trip(df, pa, write_kwargs={"version": "2.0"})

+    def test_timezone_aware_index(self, pa, timezone_aware_date_list):


can you change this so that on future versions of pyarrow it will set check_dtype=True (IOW when the bug is fixed). otherwise I am worried this will perpetuate forever.

alternatively, could remove the check_dtype and just xfail this (and when new pyarrow fixes this test will then start to xpass which will make it fail as we have strict=True). so again could make this a conditional on an older / newer version of pyarrow.

@jreback based on the comment above by @jorisvandenbossche , I don't expect this to be "fixed" in Arrow. pytz.FixedOffset is a wrapper around timedelta like datetime.timezone is and they both implement datetime.tzinfo. ofc I'm happy to do any change you ask

Yes, there are no active plans in pyarrow to change this. Although it makes probably sense to use datetime.timezone, but if we do that, it's pytz.FixedOffset that won't be preserved, in which case another test probably needs a check_dtype=False.

@jreback I added a long description. I know this won't help a follow up improvement in the future, but at this point it's unlikely to change.

There are two scenarios:

arrow/parquet persists the initial class information

the assert_ introduces a sub-feature of check_dtype=True (check_timezone_dtype=False by default) accepting pytz.FixedOffset and datetime.timezone equality

I don't see any of them on the short or mid-term roadmap.

pandas/tests/io/test_parquet.py

alippai · 2020-10-06T21:59:35Z

Do you need any changes? What else needed for the merge?

jreback · 2020-10-06T22:09:45Z

Do you need any changes? What else needed for the merge?

ok this lgtm. if you'd add a whatsnew note in 1.2 under I/O section can get this in . ping on green.

alippai · 2020-10-06T23:20:35Z

@jreback The what's new entry is added, the CI failures are not related. Thanks!

jreback · 2020-10-06T23:59:40Z

doc/source/whatsnew/v1.2.0.rst

@@ -392,6 +392,7 @@ I/O
 - Bug in :meth:`read_csv` with ``engine='python'`` truncating data if multiple items present in first row and first element started with BOM (:issue:`36343`)
 - Removed ``private_key`` and ``verbose`` from :func:`read_gbq` as they are no longer supported in ``pandas-gbq`` (:issue:`34654`, :issue:`30200`)
 - Bumped minimum pytables version to 3.5.1 to avoid a ``ValueError`` in :meth:`read_hdf` (:issue:`24839`)
+- String representation for fixed offset timezones were not recognized (:issue:`35997`, :issue:`36004`)


sorry can u in particular call out parquet roundtrioping here as this is where this is likely to affect t the user

You are right, how about now?

…mezone pandas-dev#35997

jreback · 2020-10-07T01:45:03Z

thanks for hanging in there @alippai nice work!

alippai · 2020-10-07T09:23:21Z

This was my first contribution to pandas, thank you @dsaxton, @jbrockmendel, @jorisvandenbossche and @jreback for the help. I really enjoyed working on this and I appreciate the fact that I had close to zero technical issues setting up the dev environment.

…mezone pandas-dev#35997 (pandas-dev#36004)

dsaxton added IO Parquet parquet, feather Testing pandas testing functions or related to the test suite labels Aug 31, 2020

alippai force-pushed the patch-1 branch from 6acadb4 to 037975c Compare August 31, 2020 21:24

alippai force-pushed the patch-1 branch from 037975c to 96fcc89 Compare August 31, 2020 21:48

jbrockmendel reviewed Aug 31, 2020

View reviewed changes

pandas/_libs/tslibs/timezones.pyx Outdated Show resolved Hide resolved

alippai changed the title ~~Test for #35997~~ BUG: Can't restore index from parquet with offset-specified timezone #35997 Aug 31, 2020

alippai force-pushed the patch-1 branch 2 times, most recently from ddac669 to 00993d6 Compare August 31, 2020 22:26

dsaxton reviewed Aug 31, 2020

View reviewed changes

pandas/_libs/tslibs/timezones.pyx Outdated Show resolved Hide resolved

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

alippai force-pushed the patch-1 branch 2 times, most recently from 88369b4 to c410146 Compare September 1, 2020 00:41

dsaxton reviewed Sep 1, 2020

View reviewed changes

pandas/_libs/tslibs/timezones.pyx Outdated Show resolved Hide resolved

dsaxton added Bug and removed Testing pandas testing functions or related to the test suite labels Sep 1, 2020

jreback requested changes Sep 1, 2020

View reviewed changes

alippai force-pushed the patch-1 branch from c410146 to 5195db2 Compare September 1, 2020 10:11

alippai force-pushed the patch-1 branch 2 times, most recently from 55b6914 to 94c0763 Compare September 6, 2020 11:51

jreback requested changes Sep 6, 2020

View reviewed changes

pandas/_libs/tslibs/timezones.pyx Show resolved Hide resolved

jreback added the Timezones Timezone data dtype label Sep 6, 2020

dsaxton reviewed Sep 6, 2020

View reviewed changes

pandas/tests/io/test_parquet.py Show resolved Hide resolved

alippai force-pushed the patch-1 branch from 94c0763 to 098fe56 Compare September 7, 2020 11:23

jorisvandenbossche mentioned this pull request Sep 10, 2020

BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index #35997

Closed

2 tasks

alippai requested a review from jbrockmendel September 10, 2020 14:21

jorisvandenbossche reviewed Sep 10, 2020

View reviewed changes

alippai force-pushed the patch-1 branch from 6c1ddc9 to 2297cf0 Compare September 11, 2020 09:55

alippai requested a review from jorisvandenbossche September 12, 2020 00:51

alippai force-pushed the patch-1 branch 3 times, most recently from 9cbfb97 to ee8281b Compare September 12, 2020 10:09

jreback requested changes Sep 13, 2020

View reviewed changes

jorisvandenbossche reviewed Sep 14, 2020

View reviewed changes

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

jorisvandenbossche added this to the 1.2 milestone Sep 14, 2020

alippai force-pushed the patch-1 branch from ee8281b to 0a3a622 Compare September 14, 2020 09:47

alippai requested review from jreback and jorisvandenbossche September 14, 2020 09:56

alippai force-pushed the patch-1 branch from 0a3a622 to cb1cb9d Compare September 16, 2020 08:58

alippai force-pushed the patch-1 branch from cb1cb9d to 44eea1f Compare October 6, 2020 22:25

jreback requested changes Oct 7, 2020

View reviewed changes

alippai force-pushed the patch-1 branch from 44eea1f to cbbf5cf Compare October 7, 2020 00:07

BUG: Pandas can't restore index from parquet with offset-specified ti…

1772840

…mezone pandas-dev#35997

alippai force-pushed the patch-1 branch from cbbf5cf to 1772840 Compare October 7, 2020 00:09

jreback approved these changes Oct 7, 2020

View reviewed changes

jreback merged commit a27c32a into pandas-dev:master Oct 7, 2020

jorisvandenbossche mentioned this pull request Oct 21, 2020

BUG: Pyarrow 2.0.0 broke test_timezone_aware_index 6/7 tests #37286

Open

simonjayhawkins mentioned this pull request Oct 21, 2020

CI: temporary skip parquet tz test for pyarrow>=2.0.0 #37303

Merged

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

e6321ec

…mezone pandas-dev#35997 (pandas-dev#36004)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Can't restore index from parquet with offset-specified timezone #35997 #36004

BUG: Can't restore index from parquet with offset-specified timezone #35997 #36004

alippai commented Aug 31, 2020 •

edited

Loading

pep8speaks commented Aug 31, 2020 •

edited

Loading

dsaxton commented Aug 31, 2020

alippai commented Aug 31, 2020

alippai commented Aug 31, 2020

dsaxton commented Aug 31, 2020

alippai commented Aug 31, 2020

dsaxton left a comment

jreback left a comment

alippai commented Sep 1, 2020

alippai commented Sep 1, 2020 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Sep 10, 2020

alippai Sep 11, 2020

alippai Sep 11, 2020

jorisvandenbossche Sep 12, 2020

alippai Sep 12, 2020

jreback Sep 13, 2020

alippai Sep 13, 2020

jorisvandenbossche Sep 14, 2020

alippai Sep 14, 2020

alippai commented Oct 6, 2020

jreback commented Oct 6, 2020

alippai commented Oct 6, 2020

jreback Oct 6, 2020

alippai Oct 7, 2020 •

edited

Loading

jreback commented Oct 7, 2020

alippai commented Oct 7, 2020

BUG: Can't restore index from parquet with offset-specified timezone #35997 #36004

BUG: Can't restore index from parquet with offset-specified timezone #35997 #36004

Conversation

alippai commented Aug 31, 2020 • edited Loading

pep8speaks commented Aug 31, 2020 • edited Loading

Comment last updated at 2020-10-07 00:09:08 UTC

dsaxton commented Aug 31, 2020

alippai commented Aug 31, 2020

alippai commented Aug 31, 2020

dsaxton commented Aug 31, 2020

alippai commented Aug 31, 2020

dsaxton left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

alippai commented Sep 1, 2020

alippai commented Sep 1, 2020 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alippai commented Oct 6, 2020

jreback commented Oct 6, 2020

alippai commented Oct 6, 2020

Choose a reason for hiding this comment

alippai Oct 7, 2020 • edited Loading

Choose a reason for hiding this comment

jreback commented Oct 7, 2020

alippai commented Oct 7, 2020

alippai commented Aug 31, 2020 •

edited

Loading

pep8speaks commented Aug 31, 2020 •

edited

Loading

alippai commented Sep 1, 2020 •

edited

Loading

alippai Oct 7, 2020 •

edited

Loading