Skip to content

RangeIndex is converted to Int64Index on save to HDF5 (to_hdf) #19997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vfilimonov opened this issue Mar 5, 2018 · 3 comments
Closed

RangeIndex is converted to Int64Index on save to HDF5 (to_hdf) #19997

vfilimonov opened this issue Mar 5, 2018 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request IO HDF5 read_hdf, HDFStore

Comments

@vfilimonov
Copy link
Contributor

Hello, I'm not sure if it is an intended behavior or not, and I did not find any mention about this in the documentation or in the github issue tracker. I'm filing it - just in case it was not planned to work this way.

Problem description

On save to HDF5 file RangeIndex of pandas.DataFrame is converted to Int64Index (which could add quite some to the stored space for the long tables).

df = pd.DataFrame(np.random.randn(1000,2))
df.index

results in RangeIndex(start=0, stop=1000, step=1)

Then

df.to_hdf('tmp.h5', 'df')
df = pd.read_hdf('tmp.h5', 'df')
df.index

results in Int64Index([ 0, 1, ..., 999], dtype='int64', length=1000)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@max-sixty
Copy link
Contributor

Is there a more efficient way of representing a range in HDF5?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2018

duplicated of #8319

its not worth it trying to finese, this, rather just have an option to turn it off

@jreback jreback closed this as completed Mar 5, 2018
@jreback jreback added IO HDF5 read_hdf, HDFStore Duplicate Report Duplicate issue or pull request labels Mar 5, 2018
@jreback jreback added this to the No action milestone Mar 5, 2018
@jreback
Copy link
Contributor

jreback commented Mar 5, 2018

PR's to fix are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

3 participants