Skip to content

Retrieving large frames with sparse data from hdf5 - 'NoneType' object is not iterable error #2299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Vankisa opened this issue Nov 20, 2012 · 3 comments
Milestone

Comments

@Vankisa
Copy link

Vankisa commented Nov 20, 2012

I have been using pandas within my scripts for some time now, especially to store large data sets in an easily accessible way. I have stumbled upon this problem a couple of days ago and have not been able to solve it so far.

The problem is that after I store a huge data frame into an hdf5 file, when I later load it back, it sometimes has one or more columns (only from the object type columns) completely inaccessible and returning the 'NoneType object is not iterable' error.

While I use the frame in memory there are no problems, even with moderately larger data sets than the example below. It is worth mentioning that the frame contains either multiple datetime columns or multiple VMS timestamps (http://labs.hoffmanlabs.com/node/735), as well as string and char and integer columns. All non-object columns can and do have missing values.

At first I thought I was saving 'NA' values in one of the 'object type' columns. Then I tried to update to latest pandas version (0.9.1). I was using 0.9.0 when this problem first occurred. Neither seem to be the solution.

I have been able to reproduce the error with the following code:

import pandas as pd
import numpy as np
import datetime

# Get VMS timestamps for today
time_now = datetime.datetime.today()
start_vms = datetime.datetime(1858, 11, 17)
t_delta = (time_now - start_vms)
vms_time = t_delta.total_seconds() * 10000000

# Generate Test Frame (dense)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
for i in range(2000000):
    vms_time1 += 15 * np.random.randn()
    vms_time2 += 25 * np.random.randn()
    vms_time_diff = vms_time2 - vms_time1
    string1 = 'XXXXXXXXXX'
    string2 = 'XXXXXXXXXX'
    string3 = 'XXXXX'
    string4 = 'XXXXX'
    char1 = 'A'
    char2 = 'B'
    char3 = 'C'
    char4 = 'D'
    number1 = np.random.randint(1,10)
    number2 = np.random.randint(1,100)
    number3 = np.random.randint(1,1000)
    test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))

df = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])

# Generate Test Frame (sparse)
test_records = []
vms_time1 = vms_time
vms_time2 = vms_time
count = 0
for i in range(2000000):
    if (count%23 == 0):
        vms_time1 += 15 * np.random.randn()
        string1 = 'XXXXXXXXXX'
        string2 = ' '
        string3 = 'XXXXX'
        string4 = 'XXXXX'
        char1 = 'A'
        char2 = 'B'
        char3 = 'C'
        char4 = 'D'
        number1 = None
        number2 = np.random.randint(1,100)
        number3 = np.random.randint(1,1000)
        test_records.append((char1, string1, vms_time1, number1, char2, None, None, number2, char3, string3, None, number3, None, string4))
    else:
        vms_time1 += 15 * np.random.randn()
        vms_time2 += 25 * np.random.randn()
        vms_time_diff = vms_time2 - vms_time1
        string1 = 'XXXXXXXXXX'
        string2 = 'XXXXXXXXXX'
        string3 = 'XXXXX'
        string4 = 'XXXXX'
        char1 = 'A'
        char2 = 'B'
        char3 = 'C'
        char4 = 'D'
        number1 = np.random.randint(1,10)
        number2 = np.random.randint(1,100)
        number3 = np.random.randint(1,1000)
        test_records.append((char1, string1, vms_time1, number1, char2, string2, vms_time2, number2, char3, string3, vms_time_diff, number3, char4, string4))
    count += 1

df1 = pd.DataFrame(test_records, columns = ["column_1", "column_2", "column_3", "column_4", "column_5", "column_6", "column_7", "column_8", "column_9", "column_10", "column_11", "column_12", "column_13", "column_14"])

store_loc = "foo.h5"
h5_store = pd.HDFStore(store_loc )
h5_store['df1'] = df
h5_store['df2'] = df1
h5_store.close()

When I try to load from this store now the 'df1' is behaving normally, but the 'df2' is producing the following error:

TypeError: 'NoneType' object is not iterable

Additionally I just tried to reproduce this error on pandas version 0.8.1. It does not seem to be present there. So it is probably connected with the I/O changes introduced in 0.9.0?

@ghost
Copy link

ghost commented Nov 25, 2012

There's a pending PR #2346, can you reproduce with that applied?

@wesm
Copy link
Member

wesm commented Nov 25, 2012

git bisect turned up the commit (80b5082) that caused this breakage-- I think it's a bug in PyTables, I'm still digging.

@wesm wesm closed this as completed in 3bf6a46 Nov 26, 2012
@wesm
Copy link
Member

wesm commented Nov 26, 2012

Somehow this looks like a bug in PyTables. The above fix makes your test case work, we'll have to wait and see if it happens again somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants