Skip to content

BUG: Can't restore index from parquet with offset-specified timezone #35997 #36004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 7, 2020

Conversation

alippai
Copy link
Contributor

@alippai alippai commented Aug 31, 2020

@pep8speaks
Copy link

pep8speaks commented Aug 31, 2020

Hello @alippai! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-07 00:09:08 UTC

@dsaxton dsaxton added IO Parquet parquet, feather Testing pandas testing functions or related to the test suite labels Aug 31, 2020
@dsaxton
Copy link
Member

dsaxton commented Aug 31, 2020

Thanks @alippai, however the test doesn't seem to be passing? I think we would first need to fix the issue before committing the test.

@alippai
Copy link
Contributor Author

alippai commented Aug 31, 2020

Like with TDD, you write the expectation as a testcase and it fails, then the code is changed and the untouched test passes. Isn't this the workflow for fixing a bug? I didn't work on pandas codebase before, but I can give it a try.

@alippai
Copy link
Contributor Author

alippai commented Aug 31, 2020

BTW Setting up the environment works like a charm, but the thing I did is in the official docs too: https://pandas.pydata.org/docs/dev/development/contributing.html#test-driven-development-code-writing

So, before actually writing any code, you should write your tests. Often the test can be taken from the original GitHub issue.

@dsaxton
Copy link
Member

dsaxton commented Aug 31, 2020

It's a good practice to write tests first, but a PR should include both the test and the fix that makes the test pass (unless the test is already passing of course)

@alippai
Copy link
Contributor Author

alippai commented Aug 31, 2020

@dsaxton @jbrockmendel I've pushed a naive fix, what do you think? Should I skip regex or is it OK?

@alippai alippai changed the title Test for #35997 BUG: Can't restore index from parquet with offset-specified timezone #35997 Aug 31, 2020
@alippai alippai force-pushed the patch-1 branch 2 times, most recently from ddac669 to 00993d6 Compare August 31, 2020 22:26
Copy link
Member

@dsaxton dsaxton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where else is maybe_get_tz used? Curious why this wouldn't have broken anything previously. Also I think this change likely requires more tests in /tests/tslibs/test_timezones.py.

@alippai alippai force-pushed the patch-1 branch 2 times, most recently from 88369b4 to c410146 Compare September 1, 2020 00:41
@dsaxton dsaxton added Bug and removed Testing pandas testing functions or related to the test suite labels Sep 1, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run asv's associated with timezones and see how this performs.

@alippai
Copy link
Contributor Author

alippai commented Sep 1, 2020

@jreback Run it, there are no significant changes within 10% (beside the fact that index_cached_properties.IndexCache.* looks totally random)

@alippai
Copy link
Contributor Author

alippai commented Sep 1, 2020

I'm giving this up for now, feel free to grab & fix it

@alippai alippai force-pushed the patch-1 branch 2 times, most recently from 55b6914 to 94c0763 Compare September 6, 2020 11:51
@jreback jreback added the Timezones Timezone data dtype label Sep 6, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of the arrow/parquet bug (which I also fixed in pyarrow now), I think this change is probably good, because it basically means that we accept the string repr of datetime.timezone fixed offsets.

Adding it to the fixture ensures that it is tested in a variety of tests, but are we sure there is somewhere a test that specifically asserts the correctness of the conversion (the code added in maybe_get_tz)? I would maybe add an explicit test for this (not relying on that fixture).

def test_timezone_aware_index(self, pa, timezone_aware_date_list):
idx = 5 * [timezone_aware_date_list]

df = pd.DataFrame(index=idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add it as a column (can use the same values) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, actually this assertion fails:

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="index_as_col") are different

Attribute "dtype" are different
[left]:  datetime64[ns, UTC-02:15]
[right]: datetime64[ns, pytz.FixedOffset(-135)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche now that pyarrow does something unexpected I'm not sure how to proceed with this PR. Can you advice, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessarily unexpected from pyarrow (it just always uses pytz (for better or worse) for fixed offsets, it only stores the actual fixed offset, not what class was used on the python side, but will answer about that on the Arrow PR). So I would just test the current behaviour: you can create the expected dataframe (with the different timezone) and pass this to check_round_trip.

Separately from this, I am wondering if we should treat fixed offset datetime.timezone(..) and pytz.FixedOffset(..) as equal when comparing dtypes (@jbrockmendel ?). Or at least have an utility or keyword in the assert_ to indicate they can be considered equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I exposed the check_dtype parameter for now, this way it passes. What do you think?

@@ -724,6 +744,13 @@ def test_timestamp_nanoseconds(self, pa):
df = pd.DataFrame({"a": pd.date_range("2017-01-01", freq="1n", periods=10)})
check_round_trip(df, pa, write_kwargs={"version": "2.0"})

def test_timezone_aware_index(self, pa, timezone_aware_date_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this so that on future versions of pyarrow it will set check_dtype=True (IOW when the bug is fixed). otherwise I am worried this will perpetuate forever.

alternatively, could remove the check_dtype and just xfail this (and when new pyarrow fixes this test will then start to xpass which will make it fail as we have strict=True). so again could make this a conditional on an older / newer version of pyarrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback based on the comment above by @jorisvandenbossche , I don't expect this to be "fixed" in Arrow. pytz.FixedOffset is a wrapper around timedelta like datetime.timezone is and they both implement datetime.tzinfo. ofc I'm happy to do any change you ask

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are no active plans in pyarrow to change this. Although it makes probably sense to use datetime.timezone, but if we do that, it's pytz.FixedOffset that won't be preserved, in which case another test probably needs a check_dtype=False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I added a long description. I know this won't help a follow up improvement in the future, but at this point it's unlikely to change.

There are two scenarios:

  1. arrow/parquet persists the initial class information
  2. the assert_ introduces a sub-feature of check_dtype=True (check_timezone_dtype=False by default) accepting pytz.FixedOffset and datetime.timezone equality

I don't see any of them on the short or mid-term roadmap.

@alippai
Copy link
Contributor Author

alippai commented Oct 6, 2020

Do you need any changes? What else needed for the merge?

@jreback
Copy link
Contributor

jreback commented Oct 6, 2020

Do you need any changes? What else needed for the merge?

ok this lgtm. if you'd add a whatsnew note in 1.2 under I/O section can get this in . ping on green.

@alippai
Copy link
Contributor Author

alippai commented Oct 6, 2020

@jreback The what's new entry is added, the CI failures are not related. Thanks!

@@ -392,6 +392,7 @@ I/O
- Bug in :meth:`read_csv` with ``engine='python'`` truncating data if multiple items present in first row and first element started with BOM (:issue:`36343`)
- Removed ``private_key`` and ``verbose`` from :func:`read_gbq` as they are no longer supported in ``pandas-gbq`` (:issue:`34654`, :issue:`30200`)
- Bumped minimum pytables version to 3.5.1 to avoid a ``ValueError`` in :meth:`read_hdf` (:issue:`24839`)
- String representation for fixed offset timezones were not recognized (:issue:`35997`, :issue:`36004`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry can u in particular call out parquet roundtrioping here as this is where this is likely to affect t the user

Copy link
Contributor Author

@alippai alippai Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, how about now?

@jreback jreback merged commit a27c32a into pandas-dev:master Oct 7, 2020
@jreback
Copy link
Contributor

jreback commented Oct 7, 2020

thanks for hanging in there @alippai nice work!

@alippai
Copy link
Contributor Author

alippai commented Oct 7, 2020

This was my first contribution to pandas, thank you @dsaxton, @jbrockmendel, @jorisvandenbossche and @jreback for the help. I really enjoyed working on this and I appreciate the fact that I had close to zero technical issues setting up the dev environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index
6 participants