MultiIndexes and large CSV files #4516

highpost · 2013-08-08T20:34:29Z

I've run into a data-dependent bug with MultiIndex. Attached are a test script and a data file generation script:

sales-data.py loads CSV data from a CSV data file with read_csv and then displays a multiindexed subset.
sales-gen.sh is an AWK script that generates CSV test files. By default, it creates a 600MB file, which causes the failure. To make a smaller test file, which succeeds, just use "head -1000".

My problem is that sales-data.py breaks with large CSV files but works with smaller CSV files. When it fails it displays the following error message:

'MultiIndex lexsort depth 0, key was length 4'

I know that MultiIndexes must be pre-sorted. So sales-gen.sh makes sure that the first four columns in the auto-generated CSV files are ordered. BTW, I'm not sure why lexsort is getting called.

sales-data.py

#!/usr/bin/env python

import numpy as np
import pandas as pd
from memory_profiler import profile

pd.set_option('display.height',            400)
pd.set_option('display.width',             400)
pd.set_option('display.max_rows',         1000)
pd.set_option('display.max_columns',        30)
pd.set_option('display.line_width',        200)

# @profile
def load_data():
    try:
        df = pd.read_csv(
            './sales-large.csv',
          # './sales-small.csv',
            header = None,
            na_values = ['NULL'],
            names = [
                'salesperson',
                'customer',
                'invoice_date',
                'ship_date',
                'product',
                'quantity',
                'price',
            ],
            index_col = [
                'salesperson',
                'customer',
                'invoice_date',
                'ship_date',
            ],
            parse_dates = [
                'invoice_date',
                'ship_date',
            ],
        )

        print(df.loc[(
            'A00000',                        # salesperson
            'A',                             # customer
            pd.datetime(2011,3,1,0,0,0),     # invoice_date
            pd.datetime(2011,3,6,0,0,0),     # ship_date
        )])

    except Exception as e:
        print(e)

if __name__== '__main__':
    load_data()

sales-gen.sh

#!/usr/bin/env bash

# 'salesperson',
# 'customer',
# 'invoice_date',
# 'ship_date',
# 'product',
# 'quantity',
# 'price',

awk "BEGIN {                                                                                              \
          first = 65;                                                                                     \
          last = 91;                                                                                      \
          s_char = first;                                                                                 \
          c_char = first;                                                                                 \
                                                                                                          \
          n_sa = 100;                                                                                     \
          n_cu = 100;                                                                                     \
          n_dt = 10;                                                                                      \
          n_pr = 100;                                                                                     \
                                                                                                          \
          for (i = 0; i < n_sa; i++ s_char++) {                                                           \
              if (s_char == last) {                                                                       \
                  s_char = first;                                                                         \
              }                                                                                           \
              c_char = first;                                                                             \
              for (j = 0; j < n_cu; j++ c_char++) {                                                       \
                  if (c_char == last) {                                                                   \
                      c_char = first;                                                                     \
                  }                                                                                       \
                  for (k = 1; k <= n_dt; k++) {                                                           \
                      for (l = 0; l < n_pr; l++) {                                                        \
                          printf(\"%c%05d,%c,2011-03-%02d 00:00:00,2011-03-%02d 00:00:00,%d,%d,%f\n\",    \
                              s_char, i, c_char, k, k+5, l, 1 + int(rand() * 10), rand())                 \
                      }                                                                                   \
                  }                                                                                       \
              }                                                                                           \
          }                                                                                               \
      }"

The text was updated successfully, but these errors were encountered:

jreback · 2013-08-08T22:39:03Z

Here's how to figure out the problem

big file like this best to read it in chunks, and set the index at the end
(it SHOULD be a bit faster and will have lower peak memory too)

df_iter = pd.read_csv(
    './sales-large.csv',
    header = None,
    na_values = ['NULL'],
    names = [
        'salesperson',
        'customer',
        'invoice_date',
        'ship_date',
        'product',
        'quantity',
        'price',
        ],
    parse_dates = [
        'invoice_date',
        'ship_date',
        ],
    chunksize=100000,
    )

Read from the iterator; you have duplicate values, this will show you where they are

for df in df_iter:
    dups = df.duplicated(['salesperson','customer','invoice_date','ship_date'])
    dups[dups]

    dfs.append(df)

The above is for you to inspect, but here's to create a non-duplicated frame

df = pd.concat([ df for df in df_iter ]).drop_duplicates(['salesperson','customer','invoice_date','ship_date'])

Set the index at the very end

df = df.set_index([
    'salesperson',
    'customer',
    'invoice_date',
    'ship_date',
    ])

It actually is not sorted, but doesn't matter (though you prob DO want to sort it)

df.index.lexsort_depth

Running the above script

[goat-jreback-~] python gen.py
0
product     0.00000
quantity    1.00000
price       0.49057
Name: (A00000, A, 2011-03-01 00:00:00, 2011-03-06 00:00:00), dtype: float64

highpost · 2013-08-08T23:28:24Z

Thanks, Jeff. I hate it when it's pilot error.

On Thu, Aug 8, 2013 at 3:39 PM, jreback [email protected] wrote:

Here's how to figure out the problem

big file like this best to read it in chunks, and set the index at the end

df_iter = pd.read_csv(
'./sales-large.csv',
header = None,
na_values = ['NULL'],
names = [
'salesperson',
'customer',
'invoice_date',
'ship_date',
'product',
'quantity',
'price',
],
parse_dates = [
'invoice_date',
'ship_date',
],
chunksize=100000,
)

Read from the iterator; you have duplicate values, this will show you
where they are

for df in df_iter:
dups = df.duplicated(['salesperson','customer','invoice_date','ship_date'])
dups[dups]
dfs.append(df)
The above is for you to inspect, but here's to create a non-duplicated
frame

df = pd.concat([ df for df in df_iter ]).drop_duplicates(['salesperson','customer','invoice_date','ship_date'])

Set the index at the very ends

df = df.set_index([
'salesperson',
'customer',
'invoice_date',
'ship_date',
])

It actually is not sorted, but doesn't matter (though you prob DO want to
sort it)

df.index.lexsort_depth

Running the above script

[goat-jreback-~] python gen.py
0
product 0.00000
quantity 1.00000
price 0.49057
Name: (A00000, A, 2011-03-01 00:00:00, 2011-03-06 00:00:00), dtype: float64

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4516#issuecomment-22363578
.

jreback · 2013-08-09T13:11:53Z

@DanSears this actually is a bug, has to do with how a frame with a multi-index is reindexed when there are multiple dtypes in the data (e.g. in the example below there are floats and ints). This is not implemented, but not too hard. thanks for the report.

In your case, depending no how you are going to access the data, I actually wouldn't set an index at all.

What are you going to do with this data, eg. are you going to simple look up fields like you posted?

In [6]: df = DataFrame([['foo','bar',1.0,1],['foo','bar',2.0,2],['bah','bam',3.0,3],
   ...: ['bah','bam',4.0,4],['foo','bar',5.0,5],['bah','bam',6.0,6]],
   ...: columns=list('ABCD'))

In [8]: df
Out[8]: 
     A    B  C  D
0  foo  bar  1  1
1  foo  bar  2  2
2  bah  bam  3  3
3  bah  bam  4  4
4  foo  bar  5  5
5  bah  bam  6  6
In [9]: df = df.set_index(['A','B'])

In [10]: df
Out[10]: 
         C  D
A   B        
foo bar  1  1
    bar  2  2
bah bam  3  3
    bam  4  4
foo bar  5  5
bah bam  6  6

In [11]: df.sortlevel(0)
Exception: Reindexing only valid with uniquely valued Index objects

jreback · 2013-08-09T13:43:18Z

@DanSears If youd like to give a try with #4522, this fixes the issue

highpost · 2013-08-09T15:12:50Z

Right now I'm just learning pandas and MultiIndexing looks pretty
innovative. I anticipate doing lookups like this to replace SQL selects and
so I mocked up the test data with multiple datetimes. But while my test
data passes sort(1) and uniq(1), it has a bug where its second datetime
column is not unique and so pandas objects. I'll update the awk script soon.

Thanks for looking into this. I'll definitely test anything you come up
with.

--Dan

On Fri, Aug 9, 2013 at 6:12 AM, jreback [email protected] wrote:

@DanSears https://github.com/DanSears this actually is a bug, has to do
with how a frame with a multi-index is reindexed when there are multiple
dtypes in the data (e.g. in the example below there are floats and ints).
This is not implemented, but not too hard. thanks for the report.

In your case, depending no how you are going to access the data, I
actually wouldn't set an index at all.

What are you going to do with this data, eg. are you going to simple look
up fields like you posted?

In [6]: df = DataFrame([['foo','bar',1.0,1],['foo','bar',2.0,2],['bah','bam',3.0,3],
...: ['bah','bam',4.0,4],['foo','bar',5.0,5],['bah','bam',6.0,6]],
...: columns=list('ABCD'))

In [8]: df
Out[8]:
A B C D
0 foo bar 1 1
1 foo bar 2 2
2 bah bam 3 3
3 bah bam 4 4
4 foo bar 5 5
5 bah bam 6 6
In [9]: df = df.set_index(['A','B'])

In [10]: df
Out[10]:
C D
A B
foo bar 1 1
bar 2 2
bah bam 3 3
bam 4 4
foo bar 5 5
bah bam 6 6

In [11]: df.sortlevel(0)
Exception: Reindexing only valid with uniquely valued Index objects

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4516#issuecomment-22393125
.

jreback · 2013-08-09T15:30:34Z

@DanSears

having dups in a multi-index is ok, but reallly depends on what you are doing with it.

try things like this:

df[(df['salesperson'] == 'foo') & (df['invoice_date']>Timestamp('20130101'))] for selecting

highpost · 2013-08-16T07:16:36Z

Sorry for the delay, but I've been testing your fix and here are my results.

I updated my test script and data generation script and I'm pretty sure the data is sorted correctly :) . The test script works fine for smaller datasets, but if I increase the dataset size I get:

index out of bounds

You can make smaller datasets by chopping up the CSV file with something like "head -1000000". Actually, it blows up between 999999 and 1000000 lines!

Here's load-sales.py

#!/usr/bin/env python

import numpy as np
import pandas as pd
from memory_profiler import profile

pd.set_option('display.height',            400)
pd.set_option('display.width',             400)
pd.set_option('display.max_rows',         1000)
pd.set_option('display.max_columns',        30)
pd.set_option('display.line_width',        200)

# @profile
def load_data():
    try:
        df_iter = pd.read_csv(
            './sales-sorted.csv',
            header = None,
            na_values = ['NULL'],
            names = [
                'salesperson',
                'customer',
                'invoice_date',
                'price',
                'ship_date',
                'product',
                'quantity',            
            ],
            parse_dates = [
                'invoice_date',        
                'ship_date',
            ],
            chunksize = 1000,
        )

        #
        # aggregate the chunks
        #
        df = pd.concat(
            [df for df in df_iter],
            ignore_index = True)

      # df = pd.concat(
      #     [df for df in df_iter],
      #     ignore_index = True).drop_duplicates([
      #         'salesperson',
      #         'customer',
      #         'invoice_date',
      #         'ship_date',
      #     ])

        #
        # apply the MultiIndex
        #
        df = df.set_index([
            'salesperson',
            'customer',
            'invoice_date',
            'price',
            'ship_date',
        ])

        print(df.loc[(
            'A00000',                        # salesperson
            'A',                             # customer
            pd.datetime(2011,3,1,0,0,0),     # invoice_date
            1.56,                            # price
            pd.datetime(2011,3,6,0,0,0),     # ship_date
        )])

    except Exception as e:
        print(e)

if __name__== '__main__':
    load_data()

and here's the test data generation script:

#!/usr/bin/env bash

# 'salesperson',
# 'customer',
# 'invoice_date',
# 'price',
# 'ship_date',
# 'product',
# 'quantity',

awk "BEGIN {                                                                                              \
          prices[0] = 1.1;                                                                                \
          prices[1] = 2.2;                                                                                \
          prices[2] = 3.3;                                                                                \
          prices[3] = 4.4;                                                                                \
          prices[4] = 5.5;                                                                                \
          prices[5] = 6.6;                                                                                \
          prices[6] = 7.7;                                                                                \
          prices[7] = 8.8;                                                                                \
          prices[8] = 9.9;                                                                                \
          prices[9] = 10.0;                                                                               \
                                                                                                          \
          n_sa = 300;                                                                                     \
          n_cu = 300;                                                                                     \
          n_in = 10;                                                                                      \
          n_pr = 10;                                                                                      \
                                                                                                          \
          for (i = 0; i < n_sa; i++) {                                                                    \
              for (j = 0; j < n_cu; j++) {                                                                \
                  for (k = 1; k <= n_in; k++) {                                                           \
                      for (l = 0; l < n_pr; l++) {                                                        \
                          printf(\"%d,%d,2011-03-%02d 00:00:00,%0.2f,2011-03-%02d 00:00:00,%d,%d\n\",     \
                              i, j, k, prices[l], k+5, l + 1, 1 + int(rand() * 10))                       \
                      }                                                                                   \
                  }                                                                                       \
              }                                                                                           \
          }                                                                                               \
      }"

jreback · 2013-08-16T12:22:29Z

the problem is the price field is a float. which is part of an index. This does very weird things in indicies and make them very slow. Make it a column instead and do a boolean search on it. see here

highpost · 2013-08-16T14:16:21Z

Thanks.

--Dan

On Fri, Aug 16, 2013 at 5:22 AM, jreback [email protected] wrote:

the problem is the price field is a float. which is part of an index.
This does very weird things in indicies and make them very slow. Make it a
column instead and do a boolean search on it. see herehttp://pandas.pydata.org/pandas-docs/dev/indexing.html#fallback-indexing

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4516#issuecomment-22762792
.

highpost · 2013-08-17T13:46:31Z

I converted the price column to an integer (x100) and I'm still getting the same result. That is, for data files below 1 million lines it works. But above 1M I get:

index out of bounds

change to load-sales.py:

print(df.loc[(
    'A00000',                        # salesperson
    'A',                             # customer
    pd.datetime(2011,3,1,0,0,0),     # invoice_date
    156,                             # price (x100)
    pd.datetime(2011,3,6,0,0,0),     # ship_date
)])

change to sales-gen.sh:

printf(\"%d,%d,2011-03-%02d 00:00:00,%d,2011-03-%02d 00:00:00,%d,%d\n\",        \
    i, j, k, int(prices[l] * 100), k+5, l + 1, 1 + int(rand() * 10))            \

jreback · 2013-08-17T14:11:44Z

the salesperson/customer fields are now integers? A00000, A are not there in your file at all?

jreback · 2013-08-17T14:16:20Z

Heres a couple of more tipes

access the data like this (which is in effect a short-cut for the type of lookup y ou are doing)
df.xs(('A00000','A', pd.datetime(2011,3,1,0,0,0), 156, pd.datetime(2011,3,6,0,0,0))

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9000000 entries, (0, 0, 2011-03-01 00:00:00, 110, 2011-03-06 00:00:00) to (299, 299, 2011-03-10 00:00:00, 1000, 2011-03-15 00:00:00)
Columns: 2 entries, product to quantity
dtypes: int64(2)

x = df.reset_index()
x

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9000000 entries, 0 to 8999999
Columns: 7 entries, salesperson to quantity
dtypes: datetime64[ns](2), int64(5)

x.head()
   salesperson  customer        invoice_date  price           ship_date  product  quantity
0            0         0 2011-03-01 00:00:00    110 2011-03-06 00:00:00        1        10
1            0         0 2011-03-01 00:00:00    220 2011-03-06 00:00:00        2         6
2            0         0 2011-03-01 00:00:00    330 2011-03-06 00:00:00        3         5
3            0         0 2011-03-01 00:00:00    440 2011-03-06 00:00:00        4         4
4            0         0 2011-03-01 00:00:00    550 2011-03-06 00:00:00        5         6

Now just boolean search (to make sure that the key is in fact there; you can search with one of more keys

x[x.salesperson == 'A00000']
*** TypeError: Could not compare <type 'str'> type with Series

unique values in a column

pd.Series(x.salesperson.unique())

highpost · 2013-08-17T15:39:38Z

You're right, I had copied the wrong snippet. The following query works:

print(df.loc[(
    0,                               # salesperson
    0,                               # customer
    pd.datetime(2011,3,1,0,0,0),     # invoice_date
    770,                             # price (x100)
    pd.datetime(2011,3,6,0,0,0),     # ship_date
)])

Thanks for your help with this!

--Dan

On Sat, Aug 17, 2013 at 7:16 AM, jreback [email protected] wrote:

Heres a couple of more tipes

access the data like this (which is in effect a short-cut for the type of
lookup y ou are doing)
df.xs(('A00000','A', pd.datetime(2011,3,1,0,0,0), 156,
pd.datetime(2011,3,6,0,0,0))

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9000000 entries, (0, 0, 2011-03-01 00:00:00, 110, 2011-03-06 00:00:00) to (299, 299, 2011-03-10 00:00:00, 1000, 2011-03-15 00:00:00)
Columns: 2 entries, product to quantity
dtypes: int64(2)

x = df.reset_index()
x

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9000000 entries, 0 to 8999999
Columns: 7 entries, salesperson to quantity
dtypes: datetime64ns, int64(5)

x.head()
salesperson customer invoice_date price ship_date product quantity
0 0 0 2011-03-01 00:00:00 110 2011-03-06 00:00:00 1 10
1 0 0 2011-03-01 00:00:00 220 2011-03-06 00:00:00 2 6
2 0 0 2011-03-01 00:00:00 330 2011-03-06 00:00:00 3 5
3 0 0 2011-03-01 00:00:00 440 2011-03-06 00:00:00 4 4
4 0 0 2011-03-01 00:00:00 550 2011-03-06 00:00:00 5 6

Now just boolean search (to make sure that the key is in fact there; you
can search with one of more keys

x[x.salesperson == 'A00000']
*** TypeError: Could not compare <type 'str'> type with Series

unique values in a column

pd.Series(x.salesperson.unique())

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4516#issuecomment-22812562
.

jreback mentioned this issue Aug 9, 2013

BUG: GH4516 Fixed issue with sorting a duplicate multi-index that has multiple dtypes #4522

Merged

jreback closed this as completed in #4522 Aug 9, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiIndexes and large CSV files #4516

MultiIndexes and large CSV files #4516

highpost commented Aug 8, 2013

jreback commented Aug 8, 2013

highpost commented Aug 8, 2013

jreback commented Aug 9, 2013

jreback commented Aug 9, 2013

highpost commented Aug 9, 2013

jreback commented Aug 9, 2013

highpost commented Aug 16, 2013

jreback commented Aug 16, 2013

highpost commented Aug 16, 2013

highpost commented Aug 17, 2013

jreback commented Aug 17, 2013

jreback commented Aug 17, 2013

highpost commented Aug 17, 2013

MultiIndexes and large CSV files #4516

MultiIndexes and large CSV files #4516

Comments

highpost commented Aug 8, 2013

sales-data.py

sales-gen.sh

jreback commented Aug 8, 2013

highpost commented Aug 8, 2013

jreback commented Aug 9, 2013

jreback commented Aug 9, 2013

highpost commented Aug 9, 2013

jreback commented Aug 9, 2013

highpost commented Aug 16, 2013

jreback commented Aug 16, 2013

highpost commented Aug 16, 2013

highpost commented Aug 17, 2013

jreback commented Aug 17, 2013

jreback commented Aug 17, 2013

highpost commented Aug 17, 2013