-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
HDF5: empty groups and keys #29916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I get the impression that there are two distinct users of the
User #1 is probably the one that was hit by the performance bug of tickets #21543 and #21372. |
we are not going to add methods, but wouldn’t object to a filter keyword to defaults to the current behavior |
Hi,
Shouldn't nr. 2 be the majority of use cases?
It can also just be a documentation issue, people should be aware that |
The current behavior of |
What about adding a |
I can't speak for the original author, and the code seems to have evolved gradually, but for the last few years it read Maybe the pytables version is not the most efficient, but it would be nice to have some way of getting all the nodes in an hdf5 file with pandas (as was possible before), regardless of if they can be converted or not. |
@jreback What do you think about my proposal of adding an optional |
it’s ok, kind of -0 on it as intermixing other non pandas tables is an anti pattern probably wouldn’t object to a PR though |
But isn't reading data from hdf5 files produced by other software one of the ways of getting data into pandas? I'm a little confused here. |
I am working on a small patch, but I am struggling a little to get my pandas test environment up and running (I was going the non-anaconda route and it does not really work as documented) |
Hi, All the data sets behind the keys listed in the "expected output" (OP) are perfectly fine to be imported with pandas. They are just not listed anymore which leads to downstream problems with e.g. Right now I am using a hand-patched version of pandas to have at least a working setup. |
Thanks, yes that was my impression too and I use it in this way. Instead of adding a groups = [ ... self._handle.walk_groups()
...
]
if not groups:
groups = [ ... self._handle.walk_nodes()
...
]
return groups |
@jreback Could you have a look at my pull request? It passes all tests except for the typing validation, but I don't see what is actually wrong with the code or how to fix it. |
Please could a moderator review what's happening with this? If @roberthdevries solution is to be merged I can still use pandas for my project (which is processing data from a non-pandas system) |
Hi, this issue has been fixed downstream in dask. As long as the files contain hdf5 "tables" they should work by accessing the hdf5 path directly, at least this worked and still works for me. The problem in the current issue is not that pandas could not import the files, but that the previous change in the API lead to downstream problems, i.e., they could no longer be imported with dask. |
Btw, re-reading your comment @roberthdevries, I noticed that 1 is not quite true, I can only assume that it was a design decision to list only pandas-native dataframes, and to mitigate the performance issues. |
Hi,
With some of the hdf5 files I have,
pandas.HDFStore.groups()
returns an empty list. (as does.keys()
which iterates over the groups). However, the data are accessible via.get()
or.get_node()
.This is related to #21543 and #21372 where the
.groups()
logic was changed, in particular usingself._handle.walk_groups()
instead ofself._handle.walk_nodes()
, now to be found here:pandas/pandas/io/pytables.py
Line 1212 in ea2e26a
Current Output
Expected Ouptut
List of groups and keys as visible with e.g.
h5dump
.Note: Changing the aforementioned line back to use
.walk_nodes()
fixes the issue and lists the groups and keys properly:Fix
One solution would be (I guess) to revert #21543, another to fix at least
.keys()
to use._handle.walk_nodes()
instead of.groups()
inpandas/pandas/io/pytables.py
Line 562 in ea2e26a
Could also be that it is a bug in
pytables
.Problem background
I was trying to figure out why some hdf5 files open fine with
pandas
but fail withdask
.The reason is that
dask
allows wildcards and iterates over the keys to find valid ones. If.keys()
is empty, reading the files withdask
fails.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-957.27.2.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C
LOCALE : en_US.UTF-8
pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.1.post20191125
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : None
tables : 3.6.1
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: