Skip to content

HDFStore __unicode__ very slow for many keys #16503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kiv opened this issue May 25, 2017 · 5 comments
Closed

HDFStore __unicode__ very slow for many keys #16503

Kiv opened this issue May 25, 2017 · 5 comments
Labels
IO Data IO issues that don't fit into a more specific label IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance
Milestone

Comments

@Kiv
Copy link
Contributor

Kiv commented May 25, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
store = pd.HDFStore('test.h5', 'w')
for i in range(5000):
    store.put('table_{}'.format(i), pd.DataFrame([i]))

%time str(store)
CPU times: user 26.1 s, sys: 156 ms, total: 26.2 s
Wall time: 26.2 s

Problem description

The unicode method of HDFStore iterates over all the keys in the file to create the string representation. For larger files this operation becomes extremely slow and dumps an excessive amount of output to the console.

Worse, this completely bogs down PyCharm's debugger because it calls str(store) for every store that's in scope, on every step.

Expected Output

unicode should be a fast operation - just showing the file path would be sufficient. Detailed info on all the keys could be a separate method if needed at all.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-78-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.3
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 25, 2017

please show an example with timings.

@Kiv
Copy link
Contributor Author

Kiv commented May 25, 2017

Updated description with timings. It appears to be linear in the number of keys - at 500 keys it's 2.6 seconds which is already too slow for PyCharm.

@jreback
Copy link
Contributor

jreback commented May 25, 2017

I think there is another issue about this (but can't find it at a quick glance). Sure could change __unicode__ to just the info about the filepath (the first line). Can add .info() to do this.

want to do a PR?

@jreback jreback added IO Data IO issues that don't fit into a more specific label Difficulty Intermediate IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance labels May 25, 2017
@jreback jreback added this to the Next Major Release milestone May 25, 2017
@Kiv
Copy link
Contributor Author

Kiv commented May 26, 2017

Sure, this will be my first one but I can give it a try.

@jreback jreback modified the milestones: 0.21.0, Next Major Release May 26, 2017
@TomAugspurger
Copy link
Contributor

Closed by #16666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants