Skip to content

BUG: Column-major DataFrames stored in HDFStore are returned as row-major #22073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Jul 27, 2018 · 7 comments · Fixed by #47616
Closed

BUG: Column-major DataFrames stored in HDFStore are returned as row-major #22073

wesm opened this issue Jul 27, 2018 · 7 comments · Fixed by #47616
Labels
good first issue IO HDF5 read_hdf, HDFStore Needs Tests Unit test(s) needed to prevent regressions

Comments

@wesm
Copy link
Member

wesm commented Jul 27, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3, 4],
    'b': ['foo', 'bar', 'baz', 'qux'],
    'c': [5, 6, 7, 8]
})

print(df['a'].values.strides)

store = pd.HDFStore('example.h5')
store['df'] = df

print(store['df']['a'].values.strides)
## -- End pasted text --
(8,)
(16,)

Problem description

I ran across this when doing some benchmarking. This has some rather serious performance implications for large DataFrames. Is this the result of an underlying limitation in HDF5?

@gfyoung gfyoung added the IO HDF5 read_hdf, HDFStore label Jul 27, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2018

Is this the result of an underlying limitation in HDF5?

I wonder if it has to deal with how we store the values. Tracing the code leads me somewhere to here:

pandas/pandas/io/pytables.py

Lines 4163 to 4165 in f433061

block = make_block(values, placement=np.arange(len(cols_)))
mgr = BlockManager([block], [cols_, index_])
frames.append(DataFrame(mgr))

@TannhauserGate42
Copy link

TannhauserGate42 commented May 23, 2019

I think this problem could even be more fundamental ...

Pandas copy and groupby-sum aggregations (and maybe other operations) change the major-order on the underlying data of the returned object.

This has a huge impact on aggregation performance.

Pandas should not do that implicitly.

@mroeschke mroeschke added the Bug label May 16, 2020
@mroeschke
Copy link
Member

This looks correct on master. I suppose it could use a test

In [12]: import pandas as pd
    ...:
    ...: df = pd.DataFrame({
    ...:     'a': [1, 2, 3, 4],
    ...:     'b': ['foo', 'bar', 'baz', 'qux'],
    ...:     'c': [5, 6, 7, 8]
    ...: })
    ...:
    ...: print(df['a'].values.strides)
    ...:
    ...: store = pd.HDFStore('example.h5')
    ...: store['df'] = df
    ...:
    ...: print(store['df']['a'].values.strides)
(8,)
(8,)

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug IO HDF5 read_hdf, HDFStore labels Jun 21, 2021
@jbrockmendel jbrockmendel added the IO HDF5 read_hdf, HDFStore label Dec 21, 2021
@johnmantios
Copy link
Contributor

are we looking for unit test or performance test here?

@mroeschke
Copy link
Member

are we looking for unit test or performance test here?

Unit test

@johnmantios
Copy link
Contributor

johnmantios commented Jul 5, 2022

take

@jorisvandenbossche
Copy link
Member

I think the core issue isn't actually solved. The only reason we now get back a proper column-major dataframe is because the read() makes a copy of the data (this happens in concat, by default it makes a copy of the input):

pandas/pandas/io/pytables.py

Lines 3207 to 3212 in a0071f9

if len(dfs) > 0:
out = concat(dfs, axis=1)
out = out.reindex(columns=items, copy=False)
return out
return DataFrame(columns=axes[0], index=axes[1])

But at this point, before the concat call, the data is still in row-major format. So if we want to avoid making this additional copy, while preserving column-major layout, we need to investigate why a column-major dataframe data gets stored / returned as row-major.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue IO HDF5 read_hdf, HDFStore Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants