Skip to content

Performance pd.HDFStore().keys() slow #17593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
exrich opened this issue Sep 19, 2017 · 5 comments
Closed

Performance pd.HDFStore().keys() slow #17593

exrich opened this issue Sep 19, 2017 · 5 comments
Labels
IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance

Comments

@exrich
Copy link

exrich commented Sep 19, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd, numpy as np
path = 'test.h5'
dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)]
with pd.HDFStore(path) as store:
    for i, df in enumerate(dataframes):
        store.put('test' + str(i), df)
%timeit store = pd.HDFStore(path).keys()

Problem description

The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.

It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.

/tables/file.py

def iter_nodes(self, where, classname=None):
    """Iterate over children nodes hanging from where.

    **group = self.get_node(where)**  # Does the parent exist?
    self._check_group(group)  # Is it a group?

    return group._f_iter_nodes(classname)
%lprun -f store._handle.iter_nodes store.keys()
Timer unit: 2.56e-07 s
Total time: 0.0424965 s
File: D:\Anaconda3\lib\site-packages\tables\file.py
Function: iter_nodes at line 1998
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1998                                               def iter_nodes(self, where, classname=None):
  1999                                                   """Iterate over children nodes hanging from where.
  2000                                           
  2001                                                   Parameters
  2002                                                   ----------
  2003                                                   where
  2004                                                       This argument works as in :meth:`File.get_node`, referencing the
  2005                                                       node to be acted upon.
  2006                                                   classname
  2007                                                       If the name of a class derived from
  2008                                                       Node (see :ref:`NodeClassDescr`) is supplied, only instances of
  2009                                                       that class (or subclasses of it) will be returned.
  2010                                           
  2011                                                   Notes
  2012                                                   -----
  2013                                                   The returned nodes are alphanumerically sorted by their name.
  2014                                                   This is an iterator version of :meth:`File.list_nodes`.
  2015                                           
  2016                                                   """
  2017                                           
  2018      6001       125237     20.9     75.4          group = self.get_node(where)  # Does the parent exist?
  2019      6001        26549      4.4     16.0          self._check_group(group)  # Is it a group?
  2020                                           
  2021      6001        14216      2.4      8.6          return group._f_iter_nodes(classname)

Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don't think this has been fixed in subsequent versions.

Also not sure whether to raise this in pandas or tables.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Yeah, I looked into this a while back (there may be an open issue, #16503 is related but different) but didn't come up with a solution.

Also not sure whether to raise this in pandas or tables.

I'd say see if pytables exposes an API to get the same info without the same overhead. If not then raise an issue there.

@TomAugspurger TomAugspurger added IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance Difficulty Intermediate labels Sep 19, 2017
@jreback
Copy link
Contributor

jreback commented Sep 20, 2017

closing as out of scope. this is an issue with PyTables and cannot be directly fixed in pandas.

@jreback jreback closed this as completed Sep 20, 2017
@jreback jreback added this to the No action milestone Sep 20, 2017
@MXS2514
Copy link

MXS2514 commented Sep 22, 2017

I came up similar problems these days.
Besides the pytables, it may also caused by the structure types of pandas is so powerful(many levels,tags or proprieties ), if you try structure type Series should be faster than DataFrame, but also not fast enough when using .key(), .len() on saved hdf file.

If you just want a set of dictlike {'name1' : ndarray1, 'name2' : ndarray2 ...} save on harddisk and use later with good perform(len(), .keys()), maybeh5py is enough (0.005s vs 15s when .keys())

You can use Vitables to see how the data saved and the complexity of structure.

@ben-daghir
Copy link

Found a temporary solution (at least for Python 3.6.6). Downgrade to Pandas 0.20.3 (I've had better overall performance with this version anyway). Then use the root attribute and the built-in dir() method:

store = pandas.HDFStore(file)
keys = dir(store.root)

viola! Goodluck

@TomAugspurger
Copy link
Contributor

#21543

improved the performance of .groups. Is .keys still slow?

I've had better overall performance with this version anyway

Can you see if we have open issues for those slowdowns?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants