Skip to content

BUG: round-trip of tz in an index using fixed-format for HDF5 #8165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
colbrac opened this issue Sep 3, 2014 · 10 comments
Closed

BUG: round-trip of tz in an index using fixed-format for HDF5 #8165

colbrac opened this issue Sep 3, 2014 · 10 comments
Labels
Bug IO HDF5 read_hdf, HDFStore Timezones Timezone data dtype

Comments

@colbrac
Copy link

colbrac commented Sep 3, 2014

xref #9270

With Python 2.7.6, pandas 0.13.1 and numpy 1.8.1, pytables 3.1.1, both 32 and 64bit (Python x,y and WinPython), I can load my hdf5 file.
With Python 3.4.1 64bit, pandas 0.14.1, numpy 1.8.2, pytables 3.1.1 (Anaconda3 2.0.1) I get the following error:

Traceback (most recent call last):

File "", line 1, in
test = pd.read_hdf('datafile.h5', 'data')

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 330, in read_hdf
return f(store, True)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 322, in
key, auto_close=auto_close, **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 669, in select
auto_close=auto_close).get_values()

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1335, in get_values
results = self.func(self.start, self.stop)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2658, in read
ax = self.read_index('axis%d' % i)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2257, in read_index
_, index = self.read_index_node(getattr(self.group, key))

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2385, in read_index_node
_unconvert_index(data, kind, encoding=self.encoding), **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\core\index.py", line 125, in new
result = DatetimeIndex(data, copy=copy, name=name, **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\tseries\index.py", line 301, in new
infer_dst=infer_dst)

File "tslib.pyx", line 2165, in pandas.tslib.tz_localize_to_utc (pandas\tslib.c:33574)

File "tslib.pyx", line 2082, in pandas.tslib._get_deltas (pandas\tslib.c:32187)

File "tslib.pyx", line 872, in pandas.tslib._get_utcoffset (pandas\tslib.c:16036)

AttributeError: 'numpy.bytes_' object has no attribute 'utcoffset'

The index in question is:
class 'pandas.tseries.index.DatetimeIndex'
[2013-04-03 00:00:00+02:00, ..., 2013-04-04 00:00:00+02:00]
Length: 8641, Freq: 10S, Timezone: Europe/Amsterdam

@colbrac
Copy link
Author

colbrac commented Sep 3, 2014

difference between storing in py2 and py3

After regenerating the h5 from the source files, I get the differences in group / leaf properties as shown through ViTables above.

Note: the h5 file generated with Pandas 0.14.1 in Python 3 opens with Pandas 0.13.1 in Python 2 but not vice versa.

@jreback
Copy link
Contributor

jreback commented Sep 3, 2014

pls show pd.show_versions() in each session you are trying
show all code (writing & reading)
show df.info() of the data

@jreback jreback added the HDF5 label Sep 3, 2014
@colbrac
Copy link
Author

colbrac commented Sep 3, 2014

Great (not), while generating the output both datafiles fail to open in Py3:

############### Py2 ####################

import pandas as pd
pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: nl_NL

pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.3
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: 0.9.4
lxml: 3.3.5
bs4: 4.3.2
html5lib: 0.999
bq: None
apiclient: None

data = pd.read_hdf('datafile-py2.h5', 'received')
data_py3 = pd.read_hdf('datafile-py3.h5', 'received')

No errors.
class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10081 entries, 2014-08-11 00:00:00+02:00 to 2014-08-18 00:00:00+02:00
Freq: T
Data columns (total 1085 columns):
001e:5e09:0200:1b50 int32
001e:5e09:0200:1b51 int32
001e:5e09:0200:1b52 int32
(...)
dtypes: int32(1085)

class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10081 entries, 2014-08-25 00:00:00+02:00 to 2014-09-01 00:00:00+02:00
Freq: T
Data columns (total 1082 columns):
001e:5e09:0200:1b50 int32
001e:5e09:0200:1b51 int32
001e:5e09:0200:1b52 int32
(...)
dtypes: int32(1082)

############### Py3 ####################
import pandas as pd

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: nl_NL

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.2
scipy: 0.14.0
statsmodels: None
IPython: 2.2.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.1
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

data = pd.read_hdf('datafile-py2.h5', 'received')
Traceback (most recent call last):

File "", line 1, in
data = pd.read_hdf('datafile-py2.h5', 'received')

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 330, in read_hdf
return f(store, True)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 322, in
key, auto_close=auto_close, **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 669, in select
auto_close=auto_close).get_values()

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1335, in get_values
results = self.func(self.start, self.stop)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2658, in read
ax = self.read_index('axis%d' % i)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2257, in read_index
_, index = self.read_index_node(getattr(self.group, key))

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2385, in read_index_node
_unconvert_index(data, kind, encoding=self.encoding), **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\core\index.py", line 125, in new
result = DatetimeIndex(data, copy=copy, name=name, **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\tseries\index.py", line 301, in new
infer_dst=infer_dst)

File "tslib.pyx", line 2165, in pandas.tslib.tz_localize_to_utc (pandas\tslib.c:33574)

File "tslib.pyx", line 2082, in pandas.tslib._get_deltas (pandas\tslib.c:32187)

File "tslib.pyx", line 872, in pandas.tslib._get_utcoffset (pandas\tslib.c:16036)

AttributeError: 'numpy.bytes_' object has no attribute 'utcoffset'

data_py3 = pd.read_hdf('datafile-py3.h5', 'received')
Traceback (most recent call last):

File "", line 1, in
data_py3 = pd.read_hdf('datafile-py3.h5', 'received')

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 330, in read_hdf
return f(store, True)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 322, in
key, auto_close=auto_close, **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 669, in select
auto_close=auto_close).get_values()

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 1335, in get_values
results = self.func(self.start, self.stop)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2658, in read
ax = self.read_index('axis%d' % i)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2257, in read_index
_, index = self.read_index_node(getattr(self.group, key))

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2385, in read_index_node
_unconvert_index(data, kind, encoding=self.encoding), **kwargs)

File "C:\Anaconda3\lib\site-packages\pandas\io\pytables.py", line 2182, in f
tz=tz)

File "C:\Anaconda3\lib\site-packages\pandas\tseries\index.py", line 480, in _simple_new
result.tz = tools._maybe_get_tz(tz)

File "C:\Anaconda3\lib\site-packages\pandas\tseries\tools.py", line 60, in _maybe_get_tz
tz = tslib.maybe_get_tz(tz)

File "tslib.pyx", line 1060, in pandas.tslib.maybe_get_tz (pandas\tslib.c:18462)

File "tslib.pyx", line 1073, in pandas.tslib.maybe_get_tz (pandas\tslib.c:18380)

File "C:\Anaconda3\lib\site-packages\pytz__init__.py", line 171, in timezone
zone = _unmunge_zone(zone)

File "C:\Anaconda3\lib\site-packages\pytz__init__.py", line 187, in _unmunge_zone
return zone.replace('plus', '+').replace('minus', '-')

TypeError: expected bytes, bytearray or buffer compatible object

Generating function contains the following (munging row oriented csv file):

First, I extract a time list of strings from the csv readlines and a 2D rawdata list of list of numbers (as str)

# Convert the time strings to datetime objects
timestamps = pd.to_datetime(pd.Series([time[:10]+' '+time[-8:] for time in rawdata[0]]),
                            format="%Y-%m-%d %H:%M:%S")

# Convert the measurements to a dataframe, use 32bit integers
# to keep memory use low and add the timestamps as index
measurements = pd.DataFrame(np.array(rawdata[1:], dtype=np.int64).astype(np.int32).T)
measurements = measurements.set_index(timestamps).tz_localize('Europe/Amsterdam', infer_dst=True)


Then the 2000+ columns of measurements are distributed over resultd and resultr dataframes, no changes to the index or data in the columns. Afterwards stored in h5 file.


store = pd.HDFStore(resultfilename, 'w', complib='zlib', complevel=7)
store['delivered']=resultd
store['received']=resultr
store.close()

@jreback
Copy link
Contributor

jreback commented Sep 3, 2014

This is related to : #7777 and might be fixed there

can you provide a public-accessable file and I will try this on master.

@colbrac
Copy link
Author

colbrac commented Sep 3, 2014

Generated py2 and py3 versions with this script:
import pandas as pd

datafilename_py2 = 'datafile-py2.h5'
datafilename_py3 = 'datafile-py3.h5'

drange = pd.date_range('2014-09-01', '2014-09-02', freq='1h')
drange_loc = drange.tz_localize('Europe/Amsterdam')
test = pd.Series(range(25), index=drange_loc)
test.to_hdf(datafilename_py2, 'data', complib='zlib', complevel=7)
test_py2 = pd.read_hdf(datafilename_py2, 'data')
test_py3 = pd.read_hdf(datafilename_py3, 'data')

The files can be found here: https://www.mediafire.com/?cad8fn21r4eukf0,1njl9sz9s3d7d5g,5y8nkrbklfd0d3j

Using new consoles each time, the Py3 fails to open the Py2 datafile but now again successfully opens the Py3 version. The Py2 console opens both files successfully. Could this be related to the format of the time zone with the / in it?

Also:

  • the py2 version has not set the encoding user attribute.
  • Both py2 and py3 datafile say they are pandas_version 0.10.1 which is incorrect I think

@jreback
Copy link
Contributor

jreback commented Sep 3, 2014

This might be a bug in saving it using 'fixed' format. This should work properly using table format
e.g. to_hdf(......,format='table').

@jreback jreback added the Bug label Sep 3, 2014
@jreback jreback added this to the 0.15.1 milestone Sep 3, 2014
@jreback jreback changed the title Error loading Py2 to_hdf created h5 file in Py3 due to utc offset in datetimeindex BUG: round-trip of tz in an index using fixed-format for HDF5 Sep 3, 2014
@colbrac
Copy link
Author

colbrac commented Sep 4, 2014

I have confirmed that hdf5 files created in Python 2 with format='table' open successfully in Python 3.

Do you think it is possible to adapt the read_hdf() to cope with the existing hdf5 files? I have over 900 hdf5 files, not all with tz index but still enough for me to reconsider the big Py2->Py3 jump.

@jreback
Copy link
Contributor

jreback commented Sep 4, 2014

not sure when this will get looked at
you are welcome to take a look

@colbrac
Copy link
Author

colbrac commented Sep 4, 2014

I have taken a look, played around in the debugger, but I got lost in the tseries index.py DatetimeIndex new method.
Py34 with Py2 h5 file: passes this method twice, first time tz is None, second time tz is b'Europe/Amsterdam' but the data.tz is None and the subarr is already printing tz aware strings.
Py34 with Py3 h5 file passes this method only once.

@jreback
Copy link
Contributor

jreback commented Oct 22, 2015

closing in favor of #11411

@jreback jreback closed this as completed Oct 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

2 participants