BUG: MemoryError on reading big HDF5 files #15937

mchwalisz · 2017-04-07T13:15:16Z

Code Sample, a copy-pastable example if possible

import pandas as pd
store = pd.get_store('big1.h5')
i = 0
for df in store.select('/MeasurementSamples', chunksize=100):
    i += 1
print(i)
store.close()

Result:

Traceback (most recent call last):
  File "memerror.py", line 6, in <module>
    for df in store.select(experiment, chunksize=100):
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 721, in select
    return it.get_result()
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 1409, in get_result
    self.coordinates = self.s.read_coordinates(where=self.where)
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 3652, in read_coordinates
    coords = self.selection.select_coords()
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 4718, in select_coords
    return np.arange(start, stop)
MemoryError
Closing remaining open files:big1.h5...done

Problem description

I'm not able to iterate over the chunks of file when the index array is to big and cannot fit into memory. I can also mention that I'm able to view the data with ViTables (that use PyTables internally to load data).

I'm using more less following code to create file (writing to it long enough to have 20GB of data).

import tables as tb

class FreqSample(tb.IsDescription):
    tsf = tb.Int64Col(dflt=-1)  # [us] TSF value ticks in micro seconds
    timestamp = tb.Int64Col()  # [ns] Epoch time
    frequency = tb.Float64Col()
    power = tb.Float64Col()

h5filters = tb.Filters(complib='blosc', complevel=5)
h5file = tb.open_file(fname, mode="a",
           title=title,
           filters=h5filters)
tab = h5file.create_table('/Measurement', 'a',  FreqSample)

try:
    while True:
        row = tab.row
        row['tsf'] = 1
        row['timestamp'] = 2
        row['frequency'] = 3
        row['power'] = 4
        row.append()
except:
    pass
tab.autoindex = True
tab.flush()
h5file.close()

Expected Output

I would expect the above code prints number of chunks.

Output of `pd.show_versions()`

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.10-040910-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0+739.g7b82e8b
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.4
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.8
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-04-07T13:26:40Z

HDFStore is based on PyTables, see the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5

You are trying to read essentially an incompat format. HDFStore can read a plain PyTables format table (IOW w/o any meta-data taht HDFStore uses). but h5py is not supported.

mchwalisz · 2017-04-07T13:49:51Z

Wait. I'm not using h5py at all. I'm also able to import without problems datasets created with exactly the same method using pd.read_hdf() with limitation that all has to fit into memory.

I run into problems as soon as either data doesn't fit in memory (see #11188 ) or even index (this bug) doesn't fit.

Do you have any guidelines what should I change in my process to make it work then?

jreback · 2017-04-07T13:57:19Z

sorry, I looked quick.

yes, create things using HDFStore and then it will work. Directly creating in PyTables is barely supported (with will likely not be).

mchwalisz · 2017-04-27T07:35:35Z

I still have that problem using pure pandas. Created example file using:

import os
import pandas as pd
import numpy as np
import tqdm


def sizeof_fmt(num, suffix='B'):
    for unit in ['', 'Ki', 'Mi', 'Gi', 'Ti', 'Pi', 'Ei', 'Zi']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)


path = 'test.h5'
set_size = 10**6
if os.path.exists(path):
    print('Old size: {}'.format(sizeof_fmt(os.stat(path).st_size)))
    os.remove(path)

with pd.get_store(path) as store:
    for _ in tqdm.trange(10**3 * 1):
        df = pd.DataFrame(np.random.randn(set_size, 3), columns=list('ABC'))
        try:
            nrows = store.get_storer('foo').nrows
        except AttributeError:
            nrows = 0

        df.index = pd.Series(df.index) + nrows
        store.put('/foo', df, format='table', append=True, complib='blosc')

print(sizeof_fmt(os.stat(path).st_size))

which produces a 24GB test file. Then trying to read it with:

import pandas as pd
from tqdm import tqdm

path = 'test.h5'
experiment = '/foo'

i = 0
with pd.get_store(path) as store:
    for df in tqdm(store.select(experiment, chunksize=100)):
        i += 1
    print(i)

I have:

Traceback (most recent call last):
  File "/home/chwalisz/Code/temp/memerror.py", line 9, in <module>
    for df in tqdm(store.select(experiment, chunksize=100)):
  File "/home/chwalisz/Code/miniconda3/envs/wishful/lib/python3.6/site-packages/pandas/io/pytables.py", line 724, in select
    return it.get_result()
  File "/home/chwalisz/Code/miniconda3/envs/wishful/lib/python3.6/site-packages/pandas/io/pytables.py", line 1412, in get_result
    self.coordinates = self.s.read_coordinates(where=self.where)
  File "/home/chwalisz/Code/miniconda3/envs/wishful/lib/python3.6/site-packages/pandas/io/pytables.py", line 3643, in read_coordinates
    coords = self.selection.select_coords()
  File "/home/chwalisz/Code/miniconda3/envs/wishful/lib/python3.6/site-packages/pandas/io/pytables.py", line 4709, in select_coords
    return np.arange(start, stop)
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 32
OS: Linux
OS-release: 4.4.0-71-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_EN.utf8
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: None

mchwalisz · 2017-04-27T08:30:31Z

should I put it as new issue?

jreback · 2017-04-27T16:18:19Z

yep that looks odd

mchwalisz · 2017-05-05T06:32:04Z

Shouldn't the issue be reopened? What are the next steps?

jreback · 2017-05-05T10:49:32Z

u can open a new issue
pls pls show a very minimal example

jreback · 2017-05-06T12:42:50Z

@mchwalisz would be helpful to have a complete (but simple) copy-pastable example, IOW generate the data, the repro the error w/o anything external. (you create the file above), but need to have as minimal example (w/o tqdm), and just a simple repro.

p-himik · 2017-05-07T15:32:21Z

I've encountered a similar problem, only when using where similar to 'index in 200_items_list' and not chunksize. As far as I can see, when where is not a list of row indices and its condition consists of more than 31 items (see computation.pytables.BinOp._max_selectors), the whole data is read into memory.

I have no idea where this limitation of 31 items comes from, but I think that it would make sense to (maybe only when some flag is set) read and filter data in chunks or if filter involves only the index, extract the needed row numbers, and only then select the data.

Does is make sense?

jreback · 2017-05-07T16:07:26Z

the 31 limit is because of numexpr. You wouldn't normally hit this if you follow: http://pandas.pydata.org/pandas-docs/stable/io.html#advanced-queries

p-himik · 2017-05-07T16:26:48Z

Ah, thanks for pointing me to the select_as_coordinates method! I couldn't find a way to extract only the index, but it does what I need.

I just tried to use it and found that if a user needs to extract rows from some window (set of indices that don't span over the whole data, just over some part of it), no matter continuous or not, using store.select(path, ['index >= window.min()', 'index <= window.max()', 'index in window']) may be orders of magnitude faster than using select_as_coordinates. Would it make sense to include >= and <= conditions automatically based on the values provided with in?

jreback · 2017-05-07T17:17:05Z

yes selecting a contiguous slice, then doing an in memory subselectiok is way more performant. in combination with chunking this handles most out of core cases (though does need some intelligence because you can end up selecting a huge range)

mchwalisz · 2017-05-08T11:24:56Z

Here is short example of the bug I'm encountering:

import pandas as pd
import numpy as np

path = 'test.h5'
set_size = 10**5

with pd.HDFStore(path) as store:
    for _ in range(10**5):
        df = pd.DataFrame(np.random.randn(set_size, 3), columns=list('ABC'))
        try:
            nrows = store.get_storer('foo').nrows
        except AttributeError:
            nrows = 0

        df.index = pd.Series(df.index) + nrows
        store.put('/foo', df, format='table', append=True, complib='blosc')
print('Finished creating file')

i = 0
with pd.HDFStore(path, mode='r') as store:
    for df in store.select('/foo', chunksize=1000):
        i = i + 1
print('finished, {}'.format(i))

Traceback:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/chwalisz/.miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/io/pytables.py", line 724, in select
    return it.get_result()
  File "/home/chwalisz/.miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/io/pytables.py", line 1412, in get_result
    self.coordinates = self.s.read_coordinates(where=self.where)
  File "/home/chwalisz/.miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/io/pytables.py", line 3643, in read_coordinates
    coords = self.selection.select_coords()
  File "/home/chwalisz/.miniconda3/envs/pandas-dev/lib/python3.6/site-packages/pandas/io/pytables.py", line 4709, in select_coords
    return np.arange(start, stop)
MemoryError

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-77-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_EN.utf8
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

jreback · 2017-05-08T11:54:46Z

so you have 10B rows?

mchwalisz · 2017-05-08T12:24:43Z

In this example yes. I expect it will be dependent on the amount of RAM.

In this case it will fail if number of rows * 8bytes per row (np.arange) is greater than systems RAM.

sajaddarabi · 2019-01-01T04:51:30Z

Are there any solutions to this?

jreback closed this as completed Apr 7, 2017

jreback added this to the won't fix milestone Apr 7, 2017

jreback added IO HDF5 read_hdf, HDFStore Usage Question labels Apr 7, 2017

jorisvandenbossche reopened this May 6, 2017

jorisvandenbossche removed this from the won't fix milestone May 6, 2017

jorisvandenbossche removed the Usage Question label May 6, 2017

jreback added Bug Difficulty Intermediate labels May 6, 2017

jreback added this to the Next Major Release milestone May 6, 2017

GuessWhoSamFoo mentioned this issue Oct 21, 2017

HDF5.remove with where removing all rows #17567

Open

jbrockmendel removed the Difficulty Intermediate label Oct 21, 2019

jbrockmendel removed the Effort Low label Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: MemoryError on reading big HDF5 files #15937

BUG: MemoryError on reading big HDF5 files #15937

mchwalisz commented Apr 7, 2017

INSTALLED VERSIONS

jreback commented Apr 7, 2017

mchwalisz commented Apr 7, 2017

jreback commented Apr 7, 2017

mchwalisz commented Apr 27, 2017

INSTALLED VERSIONS

mchwalisz commented Apr 27, 2017

jreback commented Apr 27, 2017

mchwalisz commented May 5, 2017

jreback commented May 5, 2017

jreback commented May 6, 2017

p-himik commented May 7, 2017

jreback commented May 7, 2017

p-himik commented May 7, 2017

jreback commented May 7, 2017

mchwalisz commented May 8, 2017

INSTALLED VERSIONS

jreback commented May 8, 2017

mchwalisz commented May 8, 2017

sajaddarabi commented Jan 1, 2019

BUG: MemoryError on reading big HDF5 files #15937

BUG: MemoryError on reading big HDF5 files #15937

Comments

mchwalisz commented Apr 7, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Apr 7, 2017

mchwalisz commented Apr 7, 2017

jreback commented Apr 7, 2017

mchwalisz commented Apr 27, 2017

INSTALLED VERSIONS

mchwalisz commented Apr 27, 2017

jreback commented Apr 27, 2017

mchwalisz commented May 5, 2017

jreback commented May 5, 2017

jreback commented May 6, 2017

p-himik commented May 7, 2017

jreback commented May 7, 2017

p-himik commented May 7, 2017

jreback commented May 7, 2017

mchwalisz commented May 8, 2017

INSTALLED VERSIONS

jreback commented May 8, 2017

mchwalisz commented May 8, 2017

sajaddarabi commented Jan 1, 2019

Output of `pd.show_versions()`