Skip to content

BUG: Pyarrow 2.0.0 broke test_timezone_aware_index 6/7 tests #37286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
phofl opened this issue Oct 20, 2020 · 7 comments
Open

BUG: Pyarrow 2.0.0 broke test_timezone_aware_index 6/7 tests #37286

phofl opened this issue Oct 20, 2020 · 7 comments
Labels
Dependencies Required and optional dependencies IO Parquet parquet, feather

Comments

@phofl
Copy link
Member

phofl commented Oct 20, 2020

Pyarrow 2.0.0 was released yesterday and broke the test test_timezone_aware_index in 6 of 7 cases. Can be seen through our CI. Location of test is pandas/tests/io/test_parquet.py

Link to ci:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=45611&view=logs&j=b1c7b65e-b3ce-541a-7fd5-29b4ba56ce18&t=46ecc253-e38f-5abc-2ea7-addca6b44d0a&l=22

cc @jreback

@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2020
@jreback jreback added this to the 1.2 milestone Oct 20, 2020
@jreback jreback added Dependencies Required and optional dependencies IO Parquet parquet, feather and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2020
@jreback
Copy link
Contributor

jreback commented Oct 20, 2020

thanks @phofl
yeah i think we need to fix these tests and then pinpyarrow <=1.0 on most branches (let this float on as many branches as you can though).

hopefuly its just our tests and not an actual break in pyarrow.

cc @jorisvandenbossche

@jorisvandenbossche
Copy link
Member

Yes, this is only a problem in our testing machinery, not an actual bug (there is a small change in behaviour in pyarrow, but not one that should be breaking for users, I think).
I put up #37303 to temporary skip the test until we fix it properly.

@jorisvandenbossche
Copy link
Member

So first, to explain what changed in pyarrow 2.0.

If you have a DataFrame using datetime.timezone fixed offset timezone, and write this to parquet, and read it back, this timezone might get converted into a pytz.FixedOffset:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import datetime

dt = datetime.datetime.now(datetime.timezone.min)
values = 5 * [dt]
df = pd.DataFrame({"column": values}, index=values)
df.to_parquet("data.parquet")

With pyarrow 1.0, this only happens for columns, and not for the index (but note that until recently, it actually errored for the index, which was fixed in #36004):

In [7]: pa.__version__
Out[7]: '1.0.1'

In [8]: result = pd.read_parquet("data.parquet")

In [9]: result.index.dtype
Out[9]: datetime64[ns, UTC-23:59]

In [10]: result.column.dtype
Out[10]: datetime64[ns, pytz.FixedOffset(-1439)]

What we changed in pyarrow 2.0, is that we now consistently use pytz.FixedOffset:

In [2]: pa.__version__
Out[2]: '2.0.0.dev528+gaec1f97e7'

In [3]: result = pd.read_parquet("data.parquet")

In [4]: result.index.dtype
Out[4]: datetime64[ns, pytz.FixedOffset(-1439)]

In [5]: result.column.dtype
Out[5]: datetime64[ns, pytz.FixedOffset(-1439)]

(this could also be consistently datetime.timezone, but then pytz fixed timezone would not be preserved. So there will always be some mismatch, as long as pyarrow does not record the original timezone class, which I am not sure we should do).

In practice, those datetime values are equal, there timezone have the offset, etc, but it's our testing machinery that cannot handle those different but equivalent timezones.

@jorisvandenbossche
Copy link
Member

To focus on the pandas part, let's illustrate the problem.

We create two Series objects with the same values but datetime.timezone vs pytz.FixedOffset fixed offset timezone:

import datetime
import pytz

dt1 = datetime.datetime.now(datetime.timezone.min)
dt2 = dt1.astimezone(pytz.FixedOffset(dt1.utcoffset().total_seconds() / 60))
s1 = pd.Series([dt1])
s2 = pd.Series([dt2])

In [2]: s1
Out[2]: 
0   2020-10-20 09:49:37.628339-23:59
dtype: datetime64[ns, UTC-23:59]

In [3]: s2
Out[3]: 
0   2020-10-20 09:49:37.628339-23:59
dtype: datetime64[ns, pytz.FixedOffset(-1439)]

Those evaluate as unequal, but with check_dtype=False, we can overcome this:

In [4]: tm.assert_series_equal(s1, s2)
...
AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  datetime64[ns, UTC-23:59]
[right]: datetime64[ns, pytz.FixedOffset(-1439)]

In [5]: tm.assert_series_equal(s1, s2, check_dtype=False)

However, for Index objects, there is no check_dtype keyword that works in a similar way:

In [6]: tm.assert_index_equal(pd.Index(s1), pd.Index(s2))
...
~/scipy/pandas/pandas/_testing.py in _check_types(l, r, obj)
    717             # Skip exact dtype checking when `check_categorical` is False
    718             if check_categorical:
--> 719                 assert_attr_equal("dtype", l, r, obj=obj)
    720 
    721             # allow string-like to have different inferred_types

    [... skipping hidden 2 frame]

AssertionError: Index are different

Attribute "dtype" are different
[left]:  datetime64[ns, UTC-23:59]
[right]: datetime64[ns, pytz.FixedOffset(-1439)]

In [8]: tm.assert_index_equal(pd.Index(s1), pd.Index(s2), exact=False)
...
AssertionError: Index are different

Index values are different (0.0 %)
[left]:  DatetimeIndex(['2020-10-20 09:49:37.628339-23:59'], dtype='datetime64[ns, UTC-23:59]', freq=None)
[right]: DatetimeIndex(['2020-10-20 09:49:37.628339-23:59'], dtype='datetime64[ns, pytz.FixedOffset(-1439)]', freq=None)

(and you can see here why check_categorical indirectly can help, as what @jbrockmendel does in #37296

@jbrockmendel
Copy link
Member

this is no longer failing on the CI, has it been resolved @jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

CI is no longer failing because #37303 temporarily skipped the test, but so the actual cause has not yet been resolved. See my comment just above for the more detailed explanation of the underlying issue.

@jreback jreback modified the milestones: 1.2, Contributions Welcome Nov 28, 2020
@jreback
Copy link
Contributor

jreback commented Nov 28, 2020

we should fix this but moving off 1.2

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Required and optional dependencies IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

5 participants