HDFStore unicode very slow for many keys #16503

Kiv · 2017-05-25T20:43:48Z

Code Sample, a copy-pastable example if possible

import pandas as pd
store = pd.HDFStore('test.h5', 'w')
for i in range(5000):
    store.put('table_{}'.format(i), pd.DataFrame([i]))

%time str(store)
CPU times: user 26.1 s, sys: 156 ms, total: 26.2 s
Wall time: 26.2 s

Problem description

The unicode method of HDFStore iterates over all the keys in the file to create the string representation. For larger files this operation becomes extremely slow and dumps an excessive amount of output to the console.

Worse, this completely bogs down PyCharm's debugger because it calls str(store) for every store that's in scope, on every step.

Expected Output

unicode should be a fast operation - just showing the file path would be sufficient. Detailed info on all the keys could be a separate method if needed at all.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-78-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.3
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-05-25T20:54:34Z

please show an example with timings.

Kiv · 2017-05-25T21:01:57Z

Updated description with timings. It appears to be linear in the number of keys - at 500 keys it's 2.6 seconds which is already too slow for PyCharm.

jreback · 2017-05-25T21:19:10Z

I think there is another issue about this (but can't find it at a quick glance). Sure could change __unicode__ to just the info about the filepath (the first line). Can add .info() to do this.

want to do a PR?

Kiv · 2017-05-26T14:25:18Z

Sure, this will be my first one but I can give it a try.

TomAugspurger · 2017-09-19T18:23:19Z

Closed by #16666

jreback added IO Data IO issues that don't fit into a more specific label Difficulty Intermediate IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance labels May 25, 2017

jreback added this to the Next Major Release milestone May 25, 2017

Kiv mentioned this issue May 26, 2017

PERF: HDFStore __unicode__ method #16514

Closed

4 tasks

jreback modified the milestones: 0.21.0, Next Major Release May 26, 2017

TomAugspurger closed this as completed Sep 19, 2017

TomAugspurger mentioned this issue Sep 19, 2017

Performance pd.HDFStore().keys() slow #17593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFStore unicode very slow for many keys #16503

HDFStore unicode very slow for many keys #16503

Kiv commented May 25, 2017 •

edited

Loading

jreback commented May 25, 2017

Kiv commented May 25, 2017

jreback commented May 25, 2017

Kiv commented May 26, 2017

TomAugspurger commented Sep 19, 2017

HDFStore __unicode__ very slow for many keys #16503

HDFStore __unicode__ very slow for many keys #16503

Comments

Kiv commented May 25, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented May 25, 2017

Kiv commented May 25, 2017

jreback commented May 25, 2017

Kiv commented May 26, 2017

TomAugspurger commented Sep 19, 2017

HDFStore unicode very slow for many keys #16503

HDFStore unicode very slow for many keys #16503

Kiv commented May 25, 2017 •

edited

Loading

Output of `pd.show_versions()`