-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Performance pd.HDFStore().keys() slow #17593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, I looked into this a while back (there may be an open issue, #16503 is related but different) but didn't come up with a solution.
I'd say see if pytables exposes an API to get the same info without the same overhead. If not then raise an issue there. |
closing as out of scope. this is an issue with PyTables and cannot be directly fixed in pandas. |
I came up similar problems these days. If you just want a set of dictlike {'name1' : ndarray1, 'name2' : ndarray2 ...} save on harddisk and use later with good perform(len(), .keys()), maybeh5py is enough (0.005s vs 15s when .keys()) You can use Vitables to see how the data saved and the complexity of structure. |
Found a temporary solution (at least for Python 3.6.6). Downgrade to Pandas 0.20.3 (I've had better overall performance with this version anyway). Then use the root attribute and the built-in dir() method:
viola! Goodluck |
improved the performance of
Can you see if we have open issues for those slowdowns? |
Code Sample, a copy-pastable example if possible
Problem description
The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.
It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.
/tables/file.py
Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don't think this has been fixed in subsequent versions.
Also not sure whether to raise this in pandas or tables.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: