-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: MemoryError on reading big HDF5 files #15937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You are trying to read essentially an incompat format. |
Wait. I'm not using I run into problems as soon as either data doesn't fit in memory (see #11188 ) or even index (this bug) doesn't fit. Do you have any guidelines what should I change in my process to make it work then? |
sorry, I looked quick. yes, create things using |
I still have that problem using pure pandas. Created example file using: import os
import pandas as pd
import numpy as np
import tqdm
def sizeof_fmt(num, suffix='B'):
for unit in ['', 'Ki', 'Mi', 'Gi', 'Ti', 'Pi', 'Ei', 'Zi']:
if abs(num) < 1024.0:
return "%3.1f%s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f%s%s" % (num, 'Yi', suffix)
path = 'test.h5'
set_size = 10**6
if os.path.exists(path):
print('Old size: {}'.format(sizeof_fmt(os.stat(path).st_size)))
os.remove(path)
with pd.get_store(path) as store:
for _ in tqdm.trange(10**3 * 1):
df = pd.DataFrame(np.random.randn(set_size, 3), columns=list('ABC'))
try:
nrows = store.get_storer('foo').nrows
except AttributeError:
nrows = 0
df.index = pd.Series(df.index) + nrows
store.put('/foo', df, format='table', append=True, complib='blosc')
print(sizeof_fmt(os.stat(path).st_size)) which produces a 24GB test file. Then trying to read it with: import pandas as pd
from tqdm import tqdm
path = 'test.h5'
experiment = '/foo'
i = 0
with pd.get_store(path) as store:
for df in tqdm(store.select(experiment, chunksize=100)):
i += 1
print(i) I have:
>>> pd.show_versions()
INSTALLED VERSIONScommit: None pandas: 0.19.2 |
should I put it as new issue? |
yep that looks odd |
Shouldn't the issue be reopened? What are the next steps? |
u can open a new issue |
@mchwalisz would be helpful to have a complete (but simple) copy-pastable example, IOW generate the data, the repro the error w/o anything external. (you create the file above), but need to have as minimal example (w/o tqdm), and just a simple repro. |
I've encountered a similar problem, only when using I have no idea where this limitation of 31 items comes from, but I think that it would make sense to (maybe only when some flag is set) read and filter data in chunks or if filter involves only the index, extract the needed row numbers, and only then select the data. Does is make sense? |
the 31 limit is because of |
Ah, thanks for pointing me to the I just tried to use it and found that if a user needs to extract rows from some window (set of indices that don't span over the whole data, just over some part of it), no matter continuous or not, using |
yes selecting a contiguous slice, then doing an in memory subselectiok is way more performant. in combination with chunking this handles most out of core cases (though does need some intelligence because you can end up selecting a huge range) |
Here is short example of the bug I'm encountering: import pandas as pd
import numpy as np
path = 'test.h5'
set_size = 10**5
with pd.HDFStore(path) as store:
for _ in range(10**5):
df = pd.DataFrame(np.random.randn(set_size, 3), columns=list('ABC'))
try:
nrows = store.get_storer('foo').nrows
except AttributeError:
nrows = 0
df.index = pd.Series(df.index) + nrows
store.put('/foo', df, format='table', append=True, complib='blosc')
print('Finished creating file')
i = 0
with pd.HDFStore(path, mode='r') as store:
for df in store.select('/foo', chunksize=1000):
i = i + 1
print('finished, {}'.format(i)) Traceback:
pd.show_versions()
INSTALLED VERSIONScommit: None pandas: 0.19.2 |
so you have 10B rows? |
In this example yes. I expect it will be dependent on the amount of RAM. In this case it will fail if number of rows * 8bytes per row ( |
Are there any solutions to this? |
Code Sample, a copy-pastable example if possible
Result:
Problem description
I'm not able to iterate over the chunks of file when the index array is to big and cannot fit into memory. I can also mention that I'm able to view the data with
ViTables
(that usePyTables
internally to load data).I'm using more less following code to create file (writing to it long enough to have 20GB of data).
Expected Output
I would expect the above code prints number of chunks.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.10-040910-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.0+739.g7b82e8b
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.4
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.8
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: