BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index #35997

alippai · 2020-08-30T18:19:25Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.

import pandas as pd
from datetime import datetime, timezone

df = pd.DataFrame([[datetime.now(timezone.utc)]], columns=['date']).set_index('date')
df.to_parquet('out.parquet')
pd.read_parquet('out.parquet')

Problem description

The bug above happens with pandas 1.1.1 and pyarrow 1.0.1.
The timezone-aware date in the index should survive the parquet round trip.
If date is not index, or when I add parameter ignore_metadata=True to the pyarrow.Table.to_pandas() it works (but date won't be an index automatically)

Expected Output

A correct DataFrame

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : f2ca0a2 python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 4.19.104-microsoft-standard Version : #1 SMP Wed Feb 19 06:37:35 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-08-31T01:00:08Z

Thanks @alippai, looks like something may have changed for 1.1.0. Previously (1.0.5 and 1.0.0 at least) writing with the pyarrow engine would actually raise an ArrowInvalid exception, whereas now you can write but can't read. If you have the dependencies installed then an apparent workaround is to write using fastparquet instead:

In [1]: import pandas as pd
   ...: from datetime import datetime, timezone
   ...:
   ...: print(pd.__version__)
   ...:
   ...: idx = 5 * [datetime.now(timezone.utc)]
   ...: df = pd.DataFrame(index=idx)
   ...: df.to_parquet("out.parquet", engine="fastparquet")
   ...: pd.read_parquet("out.parquet", engine="pyarrow")
   ...:
1.1.1
Out[1]:
Empty DataFrame
Columns: []
Index: [2020-08-31 00:50:41.828864+00:00, 2020-08-31 00:50:41.828864+00:00, 2020-08-31 00:50:41.828864+00:00, 2020-08-31 00:50:41.828864+00:00, 2020-08-31 00:50:41.828864+00:00]

cc @jorisvandenbossche who may know what's happening

dsaxton · 2020-08-31T01:58:45Z

Likely relevant:

commit be9ee6d
Author: Joris Van den Bossche [email protected]
Date: Wed Feb 5 02:08:59 2020 +0100

BUG: avoid specifying default coerce_timestamps in to_parquet (#31652)

doc/source/whatsnew/v1.1.0.rst | 4 +++-
pandas/io/parquet.py | 8 +-------
pandas/tests/io/test_parquet.py | 7 +++++++
3 files changed, 11 insertions(+), 8 deletions(-)

alippai · 2020-08-31T08:19:30Z

Note that pyarrow version="2.0" doesn't help, that was the first setup I got the error with. I made the example more minimal later :)

alippai · 2020-08-31T08:51:49Z

I've added a test, maybe it helps: https://travis-ci.org/github/pandas-dev/pandas/jobs/722666704

alippai · 2020-08-31T21:14:26Z

One more interesting thing. Using timezone from datetime fails, but using timezone from pytz works:
Code:

import pandas as pd
from datetime import datetime, timezone
import pyarrow.parquet as pq
from pytz import timezone as pytztimezone

normal_date = datetime(2011, 8, 15, 8, 15, 12, 0, timezone.utc)
pytz_date = datetime(2011, 8, 15, 8, 15, 12, 0, pytztimezone('UTC'))

print(f'Normal date: {normal_date.tzinfo}')
print(f'pytz: {pytz_date.tzinfo}')
print(f'Normal date: {normal_date.tzname()}')
print(f'pytz: {pytz_date.tzname()}')

pd.DataFrame(index=[pytz_date]).to_parquet('pytz.parquet')
pd.DataFrame(index=[normal_date]).to_parquet('normal.parquet')
print(pq.read_table('pytz.parquet').schema.metadata)
print(pq.read_table('normal.parquet').schema.metadata)
print(pq.read_table('pytz.parquet').to_pandas())
print(pq.read_table('normal.parquet').to_pandas())

Output:

Normal date: UTC
pytz: UTC
Normal date: UTC
pytz: UTC
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "columns": [{"name": null, "field_name": "__index_level_0__", "pandas_type": "datetimetz", "numpy_type": "datetime64[ns]", "metadata": {"timezone": "UTC"}}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.2.0.dev0+182.g1e1e942e7"}'}
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "empty", "numpy_type": "object", "metadata": null}], "columns": [{"name": null, "field_name": "__index_level_0__", "pandas_type": "datetimetz", "numpy_type": "datetime64[ns]", "metadata": {"timezone": "+00:00"}}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "pandas_version": "1.2.0.dev0+182.g1e1e942e7"}'}
Empty DataFrame
Columns: []
Index: [2011-08-15 08:15:12+00:00]
Traceback (most recent call last):
  File "app.py", line 19, in <module>
    print(pq.read_table('normal.parquet').to_pandas())
  File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 769, in table_to_blockmanager
    table, index = _reconstruct_index(table, index_descriptors,
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 916, in _reconstruct_index
    result_table, index_level, index_name = _extract_index_level(
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 973, in _extract_index_level
    index_level = (pd.Series(values).dt.tz_localize('utc')
  File "/home/alippai/repositories/pandas/pandas/core/accessor.py", line 99, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/indexes/accessors.py", line 104, in _delegate_method
    result = method(*args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/indexes/datetimes.py", line 233, in tz_convert
    arr = self._data.tz_convert(tz)
  File "/home/alippai/repositories/pandas/pandas/core/arrays/datetimes.py", line 797, in tz_convert
    tz = timezones.maybe_get_tz(tz)
  File "pandas/_libs/tslibs/timezones.pyx", line 91, in pandas._libs.tslibs.timezones.maybe_get_tz
    cpdef inline tzinfo maybe_get_tz(object tz):
  File "pandas/_libs/tslibs/timezones.pyx", line 106, in pandas._libs.tslibs.timezones.maybe_get_tz
    tz = pytz.timezone(tz)
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pytz/__init__.py", line 181, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: '+00:00'```

alippai · 2020-08-31T21:28:38Z

Fastparquet fails writing date with timezone but no timezone name:

import pandas as pd
from datetime import datetime, timezone
idx = [datetime.strptime('2019-01-04T16:41:24+0200', "%Y-%m-%dT%H:%M:%S%z")]
df = pd.DataFrame(index=idx)
df.to_parquet("out.parquet", engine="fastparquet")

Raises:

Traceback (most recent call last):
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/fastparquet/util.py", line 234, in get_column_metadata
    pd.Series([pd.to_datetime('now')]).dt.tz_localize(str(dtype.tz))
  File "/home/alippai/repositories/pandas/pandas/core/accessor.py", line 99, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/indexes/accessors.py", line 104, in _delegate_method
    result = method(*args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/indexes/datetimes.py", line 240, in tz_localize
    arr = self._data.tz_localize(tz, ambiguous, nonexistent)
  File "/home/alippai/repositories/pandas/pandas/core/arrays/datetimes.py", line 967, in tz_localize
    tz = timezones.maybe_get_tz(tz)
  File "pandas/_libs/tslibs/timezones.pyx", line 91, in pandas._libs.tslibs.timezones.maybe_get_tz
    cpdef inline tzinfo maybe_get_tz(object tz):
  File "pandas/_libs/tslibs/timezones.pyx", line 106, in pandas._libs.tslibs.timezones.maybe_get_tz
    tz = pytz.timezone(tz)
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pytz/__init__.py", line 181, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: 'UTC+02:00'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "app.py", line 12, in <module>
    pd.DataFrame(index=[normal_date]).to_parquet('normal.parquet', engine='fastparquet')
  File "/home/alippai/repositories/pandas/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/home/alippai/repositories/pandas/pandas/core/frame.py", line 2396, in to_parquet
    to_parquet(
  File "/home/alippai/repositories/pandas/pandas/io/parquet.py", line 303, in to_parquet
    return impl.write(
  File "/home/alippai/repositories/pandas/pandas/io/parquet.py", line 211, in write
    self.api.write(
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/fastparquet/writer.py", line 875, in write
    fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/fastparquet/writer.py", line 697, in make_metadata
    get_column_metadata(data[column], column))
  File "/home/alippai/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/fastparquet/util.py", line 237, in get_column_metadata
    raise ValueError("Time-zone information could not be serialised: "
ValueError: Time-zone information could not be serialised: UTC+02:00, please use another```

…mezone pandas-dev#35997

alippai · 2020-09-06T13:02:01Z

@jreback The PR is ready to review now

jorisvandenbossche · 2020-09-07T10:56:51Z

@alippai will take a look tomorrow (will also look into if this is not actually something we should long term solve in pyarrow, but need to better understand the actual first)

alippai · 2020-09-07T11:05:01Z

@jorisvandenbossche regarding pyarrow #35997 (comment) this behavior is the most interesting. With two same / similar timezones pyarrow serializes the metadata differently. This is something you want to change eventually.

The PR doesn't fix/improve the serialization, but it helps reading back the already written metadata.

…mezone pandas-dev#35997

jorisvandenbossche · 2020-09-10T13:06:19Z

looks like something may have changed for 1.1.0. Previously (1.0.5 and 1.0.0 at least) writing with the pyarrow engine would actually raise an ArrowInvalid exception, whereas now you can write but can't read.

@dsaxton that's actually because of the commit you linked to (#31652). If you pass coerce_timestamps=None to get pyarrow's default, you get the same error with pandas 1.0 as with 1.1.

So it seems this bug is already present some time. Also with pyarrow 0.17 I get a similar error for roundtripping a pandas dataframe to pyarrow table with tz-aware index.

@alippai the issue is indeed with how pyarrow stores the timezone in the schema metadata. For pytz it uses "UTC" (which is correctly recognized by pandas afterwards), but for datetime.timezone.utc it uses "+00:00" in the schema's pandas_metadata. So your PR to ensure pandas recognizes such format should indeed fix the issue.

I am still wondering why it doesn't fail for normal columns, though. And pyarrow should probably also recognize datetime.timezone.utc properly as "UTC".

jorisvandenbossche · 2020-09-10T14:01:08Z

So the reason is that for normal columns pyarrow first converted the string "+01:00" to a python timezone with an internal utility (pa.lib.string_to_tzinfo), and for index columns we didn't do this. Fixing this on the pyarrow side with apache/arrow#8162

I think we should still recognize datetime.timezone.utc as "UTC" as well (regardless of that, the above fix is needed in general for "fixed offset" timezones). For that I opened https://issues.apache.org/jira/browse/ARROW-9963

Independently from my fix in arrow, I think we can certainly still try to support "+01:00"-like strings in pandas as well (-> #36004)

…mezone pandas-dev#35997

…mezone #35997 (#36004)

…mezone pandas-dev#35997 (pandas-dev#36004)

alippai added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 30, 2020

dsaxton added IO Parquet parquet, feather Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2020

dsaxton changed the title ~~BUG: Reading from parque throws UnknownTimeZoneError using timezone-aware date in index~~ BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index Aug 31, 2020

alippai added a commit to alippai/pandas that referenced this issue Aug 31, 2020

Test for pandas-dev#35997

6acadb4

alippai added a commit to alippai/pandas that referenced this issue Aug 31, 2020

Test for pandas-dev#35997

037975c

alippai added a commit to alippai/pandas that referenced this issue Aug 31, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

96fcc89

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Aug 31, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

ddac669

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Aug 31, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

00993d6

…mezone pandas-dev#35997

alippai mentioned this issue Aug 31, 2020

BUG: Can't restore index from parquet with offset-specified timezone #35997 #36004

Merged

5 tasks

alippai added a commit to alippai/pandas that referenced this issue Sep 1, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

88369b4

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 1, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

c410146

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 1, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

5195db2

…mezone pandas-dev#35997

jreback added this to the Contributions Welcome milestone Sep 5, 2020

alippai added a commit to alippai/pandas that referenced this issue Sep 6, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

55b6914

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 6, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

94c0763

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 7, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

098fe56

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 8, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

6c1ddc9

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 11, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

2297cf0

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 12, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

d1b792e

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 12, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

9cbfb97

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 12, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

ee8281b

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 14, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

0a3a622

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Sep 16, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

cb1cb9d

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Oct 6, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

44eea1f

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Oct 7, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

cbbf5cf

…mezone pandas-dev#35997

alippai added a commit to alippai/pandas that referenced this issue Oct 7, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

1772840

…mezone pandas-dev#35997

jreback modified the milestones: Contributions Welcome, 1.2 Oct 7, 2020

jreback closed this as completed in #36004 Oct 7, 2020

jreback pushed a commit that referenced this issue Oct 7, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

a27c32a

…mezone #35997 (#36004)

kesmit13 pushed a commit to kesmit13/pandas that referenced this issue Nov 2, 2020

BUG: Pandas can't restore index from parquet with offset-specified ti…

e6321ec

…mezone pandas-dev#35997 (pandas-dev#36004)

torfsen mentioned this issue May 27, 2021

BUG: pytz.exceptions.UnknownTimeZoneError when loading psycopg2 timezone from Parquet #41690

Open

3 tasks

asfimport mentioned this issue Oct 10, 2020

[Python] Conversion to pandas with index column using fixed timezone fails apache/arrow#25988

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index #35997

BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index #35997

alippai commented Aug 30, 2020

dsaxton commented Aug 31, 2020 •

edited

Loading

dsaxton commented Aug 31, 2020

alippai commented Aug 31, 2020 •

edited

Loading

alippai commented Aug 31, 2020

alippai commented Aug 31, 2020

alippai commented Aug 31, 2020

alippai commented Sep 6, 2020

jorisvandenbossche commented Sep 7, 2020

alippai commented Sep 7, 2020 •

edited

Loading

jorisvandenbossche commented Sep 10, 2020

jorisvandenbossche commented Sep 10, 2020

BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index #35997

BUG: Reading from parquet throws UnknownTimeZoneError using timezone-aware date in index #35997

Comments

alippai commented Aug 30, 2020

Problem description

Expected Output

Output of pd.show_versions()

dsaxton commented Aug 31, 2020 • edited Loading

dsaxton commented Aug 31, 2020

alippai commented Aug 31, 2020 • edited Loading

alippai commented Aug 31, 2020

alippai commented Aug 31, 2020

alippai commented Aug 31, 2020

alippai commented Sep 6, 2020

jorisvandenbossche commented Sep 7, 2020

alippai commented Sep 7, 2020 • edited Loading

jorisvandenbossche commented Sep 10, 2020

jorisvandenbossche commented Sep 10, 2020

Output of `pd.show_versions()`

dsaxton commented Aug 31, 2020 •

edited

Loading

alippai commented Aug 31, 2020 •

edited

Loading

alippai commented Sep 7, 2020 •

edited

Loading