Skip to content

BUG: pytz.exceptions.UnknownTimeZoneError when loading psycopg2 timezone from Parquet #41690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
torfsen opened this issue May 27, 2021 · 2 comments
Open
2 of 3 tasks
Labels
Bug IO Parquet parquet, feather Timezones Timezone data dtype

Comments

@torfsen
Copy link

torfsen commented May 27, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import datetime as dt
import pandas as pd
from psycopg2.tz import FixedOffsetTimezone

# Simulate getting a timestamp from the DB
psycopg2_utc = FixedOffsetTimezone()
timestamp = dt.datetime.now(psycopg2_utc)

df = pd.DataFrame({'foo': [timestamp]})

df.to_parquet('test1.parquet')
try:
    pd.read_parquet('test1.parquet')
except Exception as e:
    print(f'First load failed: {e}')

# Fix by converting to pytz UTC
df['foo'] = df['foo'].dt.tz_convert('UTC')

df.to_parquet('test2.parquet')
pd.read_parquet('test2.parquet')
print('Second load succeeded')

Problem description

When a DataFrame that contains a datetime with a psycopg2 timezone object is stored in a Parquet file then loading the file raises a pytz.exceptions.UnknownTimeZoneError. Conversion to pytz UTC via .dt.tz_convert fixes the issue.

This is similar to but different from #25423 and #35997 as far as I understand.

See also this StackOverflow discussion.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 2cb9652 python : 3.8.5.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-73-generic Version : #82~18.04.1-Ubuntu SMP Fri Apr 16 15:10:02 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 20.3.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.0
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@torfsen torfsen added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 27, 2021
@mroeschke mroeschke added IO Parquet parquet, feather Timezones Timezone data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@mysterious-ben
Copy link

I've just had exactly the same issue when loaded PostgreSQL data with UTC timestamps and saved it as parquet.
pandas==1.3.3 + pyarrow==5.0.0 + psycopg2==2.8.6

@ManuelZ
Copy link

ManuelZ commented Nov 4, 2021

I'm having the same problem when I use a datetime with a timezone parsed with dateutil:

Minimally reproducible example:

import pandas as pd
import dateutil

dt = "2021-10-07T07:24:06.126+02:00"
df = pd.DataFrame({
    "a" : [1,1,2],
    "timestamp" : [dateutil.parser.parse(dt)] * 3
})

df.to_parquet("temp.parquet")

Traceback (most recent call last):
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\fastparquet\util.py", line 306, in get_column_metadata
    pd.Series([pd.to_datetime('now')]).dt.tz_localize(stz)
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pandas\core\accessor.py", line 92, in f
    return self._delegate_method(name, *args, **kwargs)
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pandas\core\indexes\accessors.py", line 109, in _delegate_method
    result = method(*args, **kwargs)
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pandas\core\indexes\datetimes.py", line 250, in tz_localize
    arr = self._data.tz_localize(tz, ambiguous, nonexistent)
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pandas\core\arrays\datetimes.py", line 980, in tz_localize
    tz = timezones.maybe_get_tz(tz)
  File "pandas\_libs\tslibs\timezones.pyx", line 91, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "pandas\_libs\tslibs\timezones.pyx", line 114, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pytz\__init__.py", line 188, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: 'tzoffset(None, 7200)'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pandas\core\frame.py", line 2455, in to_parquet
    return to_parquet(
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pandas\io\parquet.py", line 390, in to_parquet
    impl.write(
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\pandas\io\parquet.py", line 279, in write
    self.api.write(
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\fastparquet\writer.py", line 943, in write
    fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore,
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\fastparquet\writer.py", line 739, in make_metadata
    get_column_metadata(data[column], column))
  File "C:\MY_HOME_DIR\anaconda3\lib\site-packages\fastparquet\util.py", line 314, in get_column_metadata
    raise ValueError("Time-zone information could not be serialised: "
ValueError: Time-zone information could not be serialised: tzoffset(None, 7200), please use another
>>> dateutil.__version__
'2.8.1'
>>> pd.__version__
'1.2.4'
>>> pytz.__version__
'2021.1'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

4 participants