-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Error in storing large dataframe to HDFStore #2773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can u post the DataFrame summary (str(df)) |
Sure, it takes a while to build so I'll try and get it done in the next couple of days. |
@jostheim is this still an issue? |
Closing b/c I am not sure, and I put a csv style way around this in our other thread. |
Hi guys, |
well @z00b2008 w/o any detail impossible to know |
Good point, some details: -- pd.show_versions(): commit: None pandas: 0.13.1 -- df.info(): my code: data.to_hdf('/tmp/truc.h5', 'df') --error: /usr/lib/python2.7/dist-packages/pandas/io/pytables.pyc in (store) /usr/lib/python2.7/dist-packages/pandas/io/pytables.pyc in put(self, key, value, format, append, *_kwargs) /usr/lib/python2.7/dist-packages/pandas/io/pytables.pyc in _write_to_group(self, key, value, format, index, append, complib, encoding, *_kwargs) /usr/lib/python2.7/dist-packages/pandas/io/pytables.pyc in write(self, obj, **kwargs) /usr/lib/python2.7/dist-packages/pandas/io/pytables.pyc in write_array(self, key, value, items) /usr/lib/python2.7/dist-packages/tables/vlarray.pyc in append(self, sequence) /usr/lib/python2.7/dist-packages/tables/hdf5extension.so in tables.hdf5extension.VLArray._append (tables/hdf5extension.c:18234)() OverflowError: value too large to convert to int Notes:
works fine with only the following warning: /usr/lib/python2.7/dist-packages/pandas/io/pytables.py:2446: PerformanceWarning: warnings.warn(ws, PerformanceWarning)
/usr/lib/python2.7/dist-packages/pandas/io/pytables.py:2446: PerformanceWarning: warnings.warn(ws, PerformanceWarning) |
By the way, does this kind of warning suggest that I'm storing the values of these columns wrongly ? |
Some of your object columns actually contain intergers AND strings. This is not allowed (and hence the error). Aside from being very inefficient, this prob confuses the writer which expects uniform types (or strings), and NOT actual objects. It ends up pickling them. So clean/eliminate that data and you should be good. You always prob want to store as |
Hi ! If seems to work as I can do something like: However, I still cannot save the whole dataframe to one file. |
so do something like this:
to figure out what is causing the error, then post it here |
Hi !
This ran fine with no warning, no error. |
Hi guys ! |
Hi, I also get the same error. Any news on this? Why is this issue closed? |
this issue is closed because the op and subsequent poster didn't provide enough info to even see if there was a problem. if you are have an issue provide:
|
I believe I gave all the info you asked for except maybe the real data. |
@z00b2008 I still have no idea what your data looks like I get that sometimes you cannot provide data and so must ask obliquely |
@z00b2008 and you didn't report whether using table format fixed the issue |
Well, that's right. My last comment suggests that storing chunks of the data (having corrected the mixed types as you suggested), runs fine. But storing the whole dataset still gives the same error. I believe I got the same issue with the table format, but it's been a while so I could be wrong. Let me check that too and get back to you. |
Any solution for this issue so far? Code resulting the error: df.to_hdf('df_1M_sorted_with_datetime.h5','table') In [52]: df.info() INSTALLED VERSIONScommit: None pandas: 0.15.2 Thanks, |
I am having the same problem as well on the most recent version https://www.kaggle.com/c/avito-context-ad-clicks/download/AdsInfo.tsv.7z but you need to join and agree to competition rules in order to download (BTW, I really like the following comment: Here is a dump of the relevant info: In [30]: pd.show_versions() INSTALLED VERSIONScommit: None pandas: 0.16.2
|
@macd the |
Thanks. I tried to parse as strings (not this example), but was still On Fri, Jun 19, 2015 at 11:56 AM, jreback [email protected] wrote:
|
no, strings are stored as
try |
Thanks for the tip, but it didn't work. There are two columns that I want On Fri, Jun 19, 2015 at 12:32 PM, jreback [email protected] wrote:
|
strings are just fine. You are trying to store an object. This simply won't work, and is in general not recommnedd for using inside a dataframe. Use base types (floats,int,datetime,strings). Storing a 'dict' should be done in another way, e.g. another table for example. |
Hey guys, I've got the same problem. I am thinking this might be caused by the fact that my data has missing values in some columns, and a missing value would make this column an 'object' type (and this might depend on your data source, SQL or csv, etc.). Did anyone try specifying dtypes for all columns? It might work - at least seems to work in my case. Thanks. |
I also just ran into this issue today trying to save a df I'd just fetched from Redshift. I'd run this same |
I was able to find the column causing the issue! It's a column with JSON objects, in my case. Sorry I can't give too many more details- this is for my job using the company's proprietary data. But if there's any info I can provide that helps with debugging, I'm happy to provide as much as I can. To start, here are the results of
|
@ClimbsRocks you will get one of two effects as seen blow. If its a fixed format, your data would be pickled. If table the it will raise.
|
Thanks for the comment @jreback ! The odd part is, I am using Here's the full stack trace:
|
@ClimbsRocks you are in charge of your dtypes. Not really sure your issue is. |
I tried just deleting columns until it eventually wrote to file, and found that the final column I deleted before it stopped throwing that |
To follow up here: I do still think this is a bug in pandas. The column is marked as dtype object (automatically determined by pandas as result of sql query). The way I was eventually able to write this to file was to split this column out into it's own df, and separately save that df to file. Note that I made no modifications to the data, simply saving that column into it's own df. This could be caused by having quite a few dtype object columns. The one I ended up saving as it's own df was simply the largest of all the dtype object columns, so it might be something as simple as just trying to save too many characters at a time. It's also worth noting for anyone trying to debug this in the future that the column I eventually saved as it's own df is a json object. Best of luck for anyone trying to fix this! |
For anyone else facing this issue, it's likely caused by a json field. I just ran into this issue again, and fixed it by removing a different json field. |
I have an example of this problem which is definitely NOT coming from writing a complex python object that can't be matched up with a ctype (which is what Jeff Reback was saying is the root of the problems above) I have a large dataset which includes a categorical variable where every level of that variable is a string. Removing that categorical variable allows .to_hdf() to function as expected, when the variable is present it results in an error that 's consistent with the various examples shown above. pd.show_versions()
allData.info()
# problem seems to be GroupDayofWeek, a categorigal variable combining two seperate levels in the data
print('The problem variable is composed of str type only!')
for abit in allData['GroupDayofWeek'].unique():
print abit, type(abit)
jj = allData.copy()
del jj['GroupDayofWeek']
print('writing without GroupDayofWeek')
jj.to_hdf('testme.hdf','allData')
print('Everything OK! Now writing with GroupDayofWeek')
jj['GroupDayofWeek']=allData['GroupDayofWeek']
jj.to_hdf('testme.hdf','allData')
(Side note: like the others I thought this originally had to do with too large a dataset because the error implies that. It's easy to ignore a (common) warning when you've got a seemingly very different error. When there's a python object that can't be translated into a ctype I think the code should probably be more informative when it fails... of course that doesn't seem to be what's happening in my particular case, since str should be translatable) Thanks in advance for taking the time to look at this! |
and on a fixed store this is not supported, therefor this is a complex object. try using |
Jeff, Thanks for the reply. But Also, when you say "on a fixed store this is not supported, therefor this is a complex object." I don't understand... what isn't supported? Having a column with dtype object filled with str ? Writing a dataframe where different columns have different dtypes? To be clear: this isn't the only column of dtype object, filled with str, it's just the only one that causes an error. Thanks again! allData.to_hdf(outfile+'raw.hdf','allData',format='table')
|
|
HDFstore is working much better with the latest release, but I am encountering a new error I wanted to report:
df = get_joined_data(data_prefix, data_rev_prefix, date_prefix, store_filename)
File "XXXXXX", line 739, in get_joined_data
write_dataframe("joined_{0}".format(date_prefix), df, store)
File "XXXXXXX", line 55, in write_dataframe
store[name] = df
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 218, in setitem
self.put(key, value)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 458, in put
self._write_to_group(key, value, table=table, append=append, *_kwargs)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group
s.write(obj = value, append=append, complib=complib, *_kwargs)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 1837, in write
self.write_array('block%d_values' % i, blk.values)
File "/Library/Python/2.7/site-packages/pandas/io/pytables.py", line 1627, in write_array
vlarr.append(value)
File "/Library/Python/2.7/site-packages/tables-2.4.0-py2.7-macosx-10.8-intel.egg/tables/vlarray.py", line 480, in append
self._append(nparr, nobjects)
File "hdf5Extension.pyx", line 1499, in tables.hdf5Extension.VLArray._append (tables/hdf5Extension.c:13764)
OverflowError: value too large to convert to int
Not at all sure this is an actual pandas issue, but thought I would report it nonetheless.
The text was updated successfully, but these errors were encountered: