Skip to content

COMPAT: reading generic PyTables Table format fails with sub-selection #11188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rabernat opened this issue Sep 25, 2015 · 14 comments · Fixed by #26818
Closed

COMPAT: reading generic PyTables Table format fails with sub-selection #11188

rabernat opened this issue Sep 25, 2015 · 14 comments · Fixed by #26818
Labels
Bug IO HDF5 read_hdf, HDFStore
Milestone

Comments

@rabernat
Copy link

I created a file using pytables, and now I would like to read it into pandas. My (naive) expectation was that these two tools were compatible. But I am getting an error

pd.read_hdf(output_fname, '/floats/trajectories', start=0, stop=10)

gives

ValueError: Shape of passed values is (1, 10), indices imply (1, 753664023)

The file is 44 GB, so I can't really post it. I would be happy to post the h5dump --head metadata if that would help.

(Cross post with dask/dask#747.)

@jreback
Copy link
Contributor

jreback commented Sep 25, 2015

@rabernat

pandas can read a Table from PyTables. Their is simply not enough meta data to generally read other HDF5 constructs. Sure it could be done in some way, but is not a very easy/nice use case. The pandas written formats can be read by PyTables, but might not be semantically readable.

@jreback jreback closed this as completed Sep 25, 2015
@jreback jreback added Usage Question IO HDF5 read_hdf, HDFStore labels Sep 25, 2015
@rabernat
Copy link
Author

pandas can read a Table from PyTables

My file was created by pytables. That's why I was surprised it didn't work.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2015

well you would have to show a specific case then

@rabernat
Copy link
Author

The example below reproduces the error. The problem is with the start and stop kwargs. Without those, it works fine. But they are necessary if dask is going to be able to chunk the file.

import tables
import pandas as pd
import numpy as np

output_fname = 'test.h5'
class LFloat(tables.IsDescription):
    npart   = tables.Int32Col(pos=1)   # float id number, starts at 1
    time    = tables.Float32Col(pos=2)  # time of the datapoint
    x       = tables.Float32Col(pos=3)  # x position
    y       = tables.Float32Col(pos=4)  # y position
    z       = tables.Float32Col(pos=5)  # z position

dtype = tables.description.dtype_from_descr(LFloat)

nrecs = 10
with tables.openFile(output_fname, mode='w', title='Float Data') as h5file:
    group = h5file.createGroup("/", 'floats', 'Float Data')
    table = h5file.createTable(group, 'trajectories', LFloat,
                                "Float Trajectories", expectedrows=nrecs)
    for n in range(nrecs):
        d = np.empty(1, dtype)
        d['npart'] = n
        table.append(d)

    table.cols.npart.createIndex()
    table.flush()

df = pd.read_hdf('test.h5', '/floats/trajectories', start=0, stop=5)

@jreback
Copy link
Contributor

jreback commented Sep 26, 2015

ok, this is a bug in the reading this tables. never had a need to do this, so its untested. So if you want to submit a pull-request fix would be great. The issue is here I think.

@jreback jreback reopened this Sep 26, 2015
@jreback jreback added this to the Next Major Release milestone Sep 26, 2015
@jreback
Copy link
Contributor

jreback commented Sep 26, 2015

Note that with simple floats,ints,strings you would be ok. But any other data types will simply fail (e.g. datetimes). You are almost certainly better off saving this using pandas format (which is readable by PyTables).

@jreback jreback changed the title can't read pytables hdf COMPAT: reading generic PyTables Table format fails with sub-selection Sep 26, 2015
@rabernat
Copy link
Author

Thanks @jreback! I'll see what I can do.

I am not married to pytables for my application. I chose it because it allows me to write hdf files incrementally and thus scales to out-of-core file sizes. If pandas can do that directly, then I would rather use pure pandas. But my impression was that you have to first create the whole pandas dataframe in memory and then serialize to hdf. Correct?

@jreback
Copy link
Contributor

jreback commented Sep 26, 2015

certainly not. You can create want you need then stream to hdf very much like PyTables. This in fact was the reason it was built. You can chunk write/read.

Just create what you need, serialize, repeat. This is the format='table' which allows query and append. pandas also supports format='fixed' which are fixed sized PyTables structures that are much faster for IO (almost like pure numpy arrays), but don't allow appending nor querying.

See docs here.

@rabernat
Copy link
Author

Just tried your suggestion. I ended up with segfaults when trying to append to my HDFStore. I think this is related to #10672. I guess I need to update pytables to 3.2.1.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2015

3.2 is buggy yep

@mchwalisz
Copy link
Contributor

mchwalisz commented Apr 7, 2017

@jreback Could you point me again to the source of the issue? I think link you provided earlier is not valid anymore.

I'm also encountering this issue and looking for the solution. In my use case I'm using PyTables to collect measurement data to hdf5 file in the table format. I'm easily able to collect 20GB of (blosc compressed) file and currently are not able to process it.

I'm also encountering MemoryError if the index is bigger than RAM, but will prepare new bug report for it.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

#13267 fixed this for format='fixed'.

@jgehrcke
Copy link
Contributor

#13267 fixed this for format='fixed'.

I do not quite understand what this means. @rabernat's comment above shows a nice minimal working example demonstrating the problem that he found, and that also @mchwalisz has found (and that I have right now). Can we modify the example (ideally just the reading part, and not the data generation part) so that it works?

@jgehrcke
Copy link
Contributor

jgehrcke commented Jun 12, 2019

@jreback Could you point me again to the source of the issue? I think link you provided earlier is not valid anymore.

0.17.0 was released shortly after @jreback's comment (Sep 2015) and I think this is the line @jreback was pointing us to: https://github.com/pandas-dev/pandas/blob/v0.17.0/pandas/io/pytables.py#L1660

Now here:

self.values = Int64Index(np.arange(self.table.nrows))

jgehrcke added a commit to jgehrcke/pandas that referenced this issue Jun 12, 2019
jgehrcke added a commit to jgehrcke/pandas that referenced this issue Jun 12, 2019
jgehrcke added a commit to jgehrcke/pandas that referenced this issue Jun 18, 2019
jgehrcke added a commit to jgehrcke/pandas that referenced this issue Jun 18, 2019
jgehrcke added a commit to jgehrcke/pandas that referenced this issue Jun 21, 2019
jgehrcke added a commit to jgehrcke/pandas that referenced this issue Jun 21, 2019
@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 Jun 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants