BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables #41978

ra1nty · 2021-06-13T03:52:39Z

I have checked that this issue has not already been reported.
There was a issue 5 years ago mentioned that .to_hdf() acts inconsistently across Python2 & 3 on PeriodIndex for fixed format
DataFrame with PeriodIndex written in Python2 gets an Int64Index when read back in Python3 #16781
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
The bug exists, but behavior is different - see next comment

I noticed that the deserialization of a pandas Series/DataFrame with PeriodIndex from HDF5 file is inconsistent when using PyTables format: The retrieved series/df index will be converted to Int64Index instead of PeriodIndex: See code below for example

import pandas as pd
store = pd.HDFStore('test.h5')
series = pd.Series(index=pd.date_range(start='2015-01', end='2016-01', freq='M'), data=0).to_period('M')
df = pd.DataFrame(index=pd.date_range(start='2015-01', end='2016-01', freq='M'), data=0, columns=['a']).to_period('M')
store.put('/a/a', series, format='table')
store.put('/a/b', df, format='table')

store.select('/a/a')

Output:

540    0
541    0
542    0
543    0
544    0
545    0
546    0
547    0
548    0
549    0
550    0
551    0
dtype: int64

store.select('/a/b').index

Output:

Int64Index([540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551], dtype='int64')

Problem description

Inconsistent output with HDF5 file & PyTables format

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2cb9652
python : 3.9.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 20.3.1
setuptools : 51.0.0.post20201207
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.4.3
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

ra1nty · 2021-06-13T06:28:36Z

So I have figured out the issue:
The _get_data_and_dtype_name in
https://github.com/pandas-dev/pandas/blob/v1.2.4/pandas/io/pytables.py#L5070
used Index.asi8 to store the int64 values of the PeriodIndex,

but the case was unhandled in DataCol.convert and IndexCol.convert
https://github.com/pandas-dev/pandas/blob/v1.2.4/pandas/io/pytables.py#L2400
https://github.com/pandas-dev/pandas/blob/v1.2.4/pandas/io/pytables.py#L3644

For master branch, the issue still exist but instead raise TypeErrow due to not using the correct index factory in DataCol.convert and IndexCol.convert
https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L2077

The fixed-format in both master and v1.2.4 has no problem with PeriodIndex and handled the conversion.

ra1nty · 2021-06-13T19:35:26Z

E.g. A simple but not clean fix will be to add a corner case in IndexCol.convert when constructing the index factory
https://github.com/pandas-dev/pandas/blob/master/pandas/io/pytables.py#L2077

factory = Index
if is_datetime64_dtype(values.dtype) or is_datetime64tz_dtype(values.dtype):
    factory = DatetimeIndex
elif "freq" in kwargs:
    # workaround for PeriodIndex
    def f(values, freq=None, **kwargs):
        parr = PeriodArray._simple_new(values, freq=freq)
        return PeriodIndex._simple_new(parr, **kwargs)
    factory = f

From my understanding, the TimedeltaIndex and DatetimeIndex will be covered by the first if case as the correct dtype is implemented. If the 'freq' still in kwargs then it's for PeriodIndex. The workaround works on my local machine for now but I haven't got a chance to look into the pandas codebase in depth.

ra1nty · 2021-06-13T19:57:02Z

I also noticed that both fixed and table format can not handle the store of values where the underlying array is PeriodArray: while fixed format raised a readable TypeError, the table format result in a TypeError without clear information. I do think this should be fixed as well.
Code to reproduce:

series_p = pd.Series(data=pd.date_range(start='2015-01', end='2016-01', freq='M').to_period('M'))
store.put('/a/c', series_p, format='fixed')
store.put('/a/d', series_p, format='table')

Output (master & v1.2.4):
Fixed

TypeError: objects of type ``PeriodArray`` are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or bytes

PyTables

TypeError: int() argument must be a string, a bytes-like object or a number, not 'Period'

ra1nty · 2022-01-20T23:46:30Z

@mroeschke Is it ok if I start working on that since it's confirmed? I was able to patch my local pandas last year but haven't got time to re-attend to this since then.

mroeschke · 2022-01-20T23:51:27Z

Sure go for it @ra1nty

ra1nty added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 13, 2021

ra1nty changed the title ~~BUG: DataFrame/Series with PeriodIndex inconsistent deserialization with HDF5 - PyTables~~ BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables Jun 13, 2021

mroeschke added IO HDF5 read_hdf, HDFStore Period Period data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021

rickecon mentioned this issue Nov 18, 2022

Stop skipping tests in test_against_taxsim.py TheCGO/fiscalsim-us#17

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables #41978

BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables #41978

ra1nty commented Jun 13, 2021 •

edited

Loading

INSTALLED VERSIONS

ra1nty commented Jun 13, 2021 •

edited

Loading

ra1nty commented Jun 13, 2021 •

edited

Loading

ra1nty commented Jun 13, 2021

ra1nty commented Jan 20, 2022

mroeschke commented Jan 20, 2022

BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables #41978

BUG: PeriodIndex inconsistent deserialization with HDF5 - PyTables #41978

Comments

ra1nty commented Jun 13, 2021 • edited Loading

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

ra1nty commented Jun 13, 2021 • edited Loading

ra1nty commented Jun 13, 2021 • edited Loading

ra1nty commented Jun 13, 2021

ra1nty commented Jan 20, 2022

mroeschke commented Jan 20, 2022

ra1nty commented Jun 13, 2021 •

edited

Loading

Output of `pd.show_versions()`

ra1nty commented Jun 13, 2021 •

edited

Loading

ra1nty commented Jun 13, 2021 •

edited

Loading