Retrieving large frames with sparse data from hdf5 - 'NoneType' object is not iterable error #2299

Vankisa · 2012-11-20T10:30:45Z

I have been using pandas within my scripts for some time now, especially to store large data sets in an easily accessible way. I have stumbled upon this problem a couple of days ago and have not been able to solve it so far.

The problem is that after I store a huge data frame into an hdf5 file, when I later load it back, it sometimes has one or more columns (only from the object type columns) completely inaccessible and returning the 'NoneType object is not iterable' error.

While I use the frame in memory there are no problems, even with moderately larger data sets than the example below. It is worth mentioning that the frame contains either multiple datetime columns or multiple VMS timestamps (http://labs.hoffmanlabs.com/node/735), as well as string and char and integer columns. All non-object columns can and do have missing values.

At first I thought I was saving 'NA' values in one of the 'object type' columns. Then I tried to update to latest pandas version (0.9.1). I was using 0.9.0 when this problem first occurred. Neither seem to be the solution.

I have been able to reproduce the error with the following code:

import pandas as pd
import numpy as np
import datetime

# Get VMS timestamps for today
time_now = datetime.datetime.today()
start_vms = datetime.datetime(1858, 11, 17)
t_delta = (time_now - start_vms)
vms_time = t_delta.total_seconds() * 10000000

# Generate Test Frame (dense)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
for i in range(2000000):
    vms_time1 += 15 * np.random.randn()
    vms_time2 += 25 * np.random.randn()
    vms_time_diff = vms_time2 - vms_time1
    string1 = 'XXXXXXXXXX'
    string2 = 'XXXXXXXXXX'
    string3 = 'XXXXX'
    string4 = 'XXXXX'
    char1 = 'A'
    char2 = 'B'
    char3 = 'C'
    char4 = 'D'
    number1 = np.random.randint(1,10)
    number2 = np.random.randint(1,100)
    number3 = np.random.randint(1,1000)
    test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))

df = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])

# Generate Test Frame (sparse)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
count = 0
for i in range(2000000):
    if (count%23 == 0):
        vms_time1 += 15 * np.random.randn()
        string1 = 'XXXXXXXXXX'
        string2 = ' '
        string3 = 'XXXXX'
        string4 = 'XXXXX'
        char1 = 'A'
        char2 = 'B'
        char3 = 'C'
        char4 = 'D'
        number1 = None
        number2 = np.random.randint(1,100)
        number3 = np.random.randint(1,1000)
        test_records.append((char1, string1, vms_time1, number1, char2, None, None, number2, char3, string3, None, number3, None, string4))
    else:
        vms_time1 += 15 * np.random.randn()
        vms_time2 += 25 * np.random.randn()
        vms_time_diff = vms_time2 - vms_time1
        string1 = 'XXXXXXXXXX'
        string2 = 'XXXXXXXXXX'
        string3 = 'XXXXX'
        string4 = 'XXXXX'
        char1 = 'A'
        char2 = 'B'
        char3 = 'C'
        char4 = 'D'
        number1 = np.random.randint(1,10)
        number2 = np.random.randint(1,100)
        number3 = np.random.randint(1,1000)
        test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))
    count += 1

df1 = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])

store_loc = "foo.h5"
h5_store = pd.HDFStore(store_loc )
h5_store['df1'] = df
h5_store['df2'] = df1
h5_store.close()

When I try to load from this store now the 'df1' is behaving normally, but the 'df2' is producing the following error:
TypeError: 'NoneType' object is not iterable

Additionally I just tried to reproduce this error on pandas version 0.8.1. It does not seem to be present there. So it is probably connected with the I/O changes introduced in 0.9.0?

The text was updated successfully, but these errors were encountered:

ghost · 2012-11-25T22:29:33Z

There's a pending PR #2346, can you reproduce with that applied?

wesm · 2012-11-25T22:32:49Z

git bisect turned up the commit (80b5082) that caused this breakage-- I think it's a bug in PyTables, I'm still digging.

wesm · 2012-11-26T01:41:30Z

Somehow this looks like a bug in PyTables. The above fix makes your test case work, we'll have to wait and see if it happens again somewhere.

wesm closed this as completed in 3bf6a46 Nov 26, 2012

z00b2008 mentioned this issue Sep 19, 2014

Error in storing large dataframe to HDFStore #2773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving large frames with sparse data from hdf5 - 'NoneType' object is not iterable error #2299

Retrieving large frames with sparse data from hdf5 - 'NoneType' object is not iterable error #2299

Vankisa commented Nov 20, 2012

ghost commented Nov 25, 2012

wesm commented Nov 25, 2012

wesm commented Nov 26, 2012

Retrieving large frames with sparse data from hdf5 - 'NoneType' object is not iterable error #2299

Retrieving large frames with sparse data from hdf5 - 'NoneType' object is not iterable error #2299

Comments

Vankisa commented Nov 20, 2012

ghost commented Nov 25, 2012

wesm commented Nov 25, 2012

wesm commented Nov 26, 2012