Performance pd.HDFStore().keys() slow #17593

exrich · 2017-09-19T18:17:59Z

Code Sample, a copy-pastable example if possible

import pandas as pd, numpy as np
path = 'test.h5'
dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)]
with pd.HDFStore(path) as store:
    for i, df in enumerate(dataframes):
        store.put('test' + str(i), df)
%timeit store = pd.HDFStore(path).keys()

Problem description

The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.

It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.

/tables/file.py

def iter_nodes(self, where, classname=None):
    """Iterate over children nodes hanging from where.

    **group = self.get_node(where)**  # Does the parent exist?
    self._check_group(group)  # Is it a group?

    return group._f_iter_nodes(classname)

%lprun -f store._handle.iter_nodes store.keys()
Timer unit: 2.56e-07 s
Total time: 0.0424965 s
File: D:\Anaconda3\lib\site-packages\tables\file.py
Function: iter_nodes at line 1998
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1998                                               def iter_nodes(self, where, classname=None):
  1999                                                   """Iterate over children nodes hanging from where.
  2000                                           
  2001                                                   Parameters
  2002                                                   ----------
  2003                                                   where
  2004                                                       This argument works as in :meth:`File.get_node`, referencing the
  2005                                                       node to be acted upon.
  2006                                                   classname
  2007                                                       If the name of a class derived from
  2008                                                       Node (see :ref:`NodeClassDescr`) is supplied, only instances of
  2009                                                       that class (or subclasses of it) will be returned.
  2010                                           
  2011                                                   Notes
  2012                                                   -----
  2013                                                   The returned nodes are alphanumerically sorted by their name.
  2014                                                   This is an iterator version of :meth:`File.list_nodes`.
  2015                                           
  2016                                                   """
  2017                                           
  2018      6001       125237     20.9     75.4          group = self.get_node(where)  # Does the parent exist?
  2019      6001        26549      4.4     16.0          self._check_group(group)  # Is it a group?
  2020                                           
  2021      6001        14216      2.4      8.6          return group._f_iter_nodes(classname)

Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don't think this has been fixed in subsequent versions.

Also not sure whether to raise this in pandas or tables.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-09-19T18:24:47Z

Yeah, I looked into this a while back (there may be an open issue, #16503 is related but different) but didn't come up with a solution.

Also not sure whether to raise this in pandas or tables.

I'd say see if pytables exposes an API to get the same info without the same overhead. If not then raise an issue there.

jreback · 2017-09-20T10:19:36Z

closing as out of scope. this is an issue with PyTables and cannot be directly fixed in pandas.

MXS2514 · 2017-09-22T04:48:29Z

I came up similar problems these days.
Besides the pytables, it may also caused by the structure types of pandas is so powerful(many levels,tags or proprieties ), if you try structure type Series should be faster than DataFrame, but also not fast enough when using .key(), .len() on saved hdf file.

If you just want a set of dictlike {'name1' : ndarray1, 'name2' : ndarray2 ...} save on harddisk and use later with good perform(len(), .keys()), maybeh5py is enough (0.005s vs 15s when .keys())

You can use Vitables to see how the data saved and the complexity of structure.

ben-daghir · 2018-07-17T15:43:34Z

Found a temporary solution (at least for Python 3.6.6). Downgrade to Pandas 0.20.3 (I've had better overall performance with this version anyway). Then use the root attribute and the built-in dir() method:

store = pandas.HDFStore(file)
keys = dir(store.root)

viola! Goodluck

TomAugspurger · 2018-07-19T11:38:11Z

#21543

improved the performance of .groups. Is .keys still slow?

I've had better overall performance with this version anyway

Can you see if we have open issues for those slowdowns?

TomAugspurger added IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance Difficulty Intermediate labels Sep 19, 2017

jreback closed this as completed Sep 20, 2017

jreback added this to the No action milestone Sep 20, 2017

spott mentioned this issue Jun 7, 2018

Improve HDFStore.groups performance #21372

Closed

bilalkamoon mentioned this issue Mar 14, 2019

Performance pd.HDFStore().keys() slow PyTables/PyTables#726

Closed

st-bender mentioned this issue Mar 4, 2020

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance pd.HDFStore().keys() slow #17593

Performance pd.HDFStore().keys() slow #17593

exrich commented Sep 19, 2017

INSTALLED VERSIONS

TomAugspurger commented Sep 19, 2017

jreback commented Sep 20, 2017

MXS2514 commented Sep 22, 2017

ben-daghir commented Jul 17, 2018

TomAugspurger commented Jul 19, 2018

Performance pd.HDFStore().keys() slow #17593

Performance pd.HDFStore().keys() slow #17593

Comments

exrich commented Sep 19, 2017

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Sep 19, 2017

jreback commented Sep 20, 2017

MXS2514 commented Sep 22, 2017

ben-daghir commented Jul 17, 2018

TomAugspurger commented Jul 19, 2018

Output of `pd.show_versions()`