-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
MultiIndexes and large CSV files #4516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here's how to figure out the problem big file like this best to read it in chunks, and set the index at the end
Read from the iterator; you have duplicate values, this will show you where they are
The above is for you to inspect, but here's to create a non-duplicated frame
Set the index at the very end
It actually is not sorted, but doesn't matter (though you prob DO want to sort it)
Running the above script
|
Thanks, Jeff. I hate it when it's pilot error. On Thu, Aug 8, 2013 at 3:39 PM, jreback [email protected] wrote:
|
@DanSears this actually is a bug, has to do with how a frame with a multi-index is reindexed when there are multiple dtypes in the data (e.g. in the example below there are floats and ints). This is not implemented, but not too hard. thanks for the report. In your case, depending no how you are going to access the data, I actually wouldn't set an index at all. What are you going to do with this data, eg. are you going to simple look up fields like you posted?
|
@DanSears If youd like to give a try with #4522, this fixes the issue |
Right now I'm just learning pandas and MultiIndexing looks pretty Thanks for looking into this. I'll definitely test anything you come up --Dan On Fri, Aug 9, 2013 at 6:12 AM, jreback [email protected] wrote:
|
@DanSears having dups in a multi-index is ok, but reallly depends on what you are doing with it. try things like this:
|
Sorry for the delay, but I've been testing your fix and here are my results. I updated my test script and data generation script and I'm pretty sure the data is sorted correctly :) . The test script works fine for smaller datasets, but if I increase the dataset size I get:
You can make smaller datasets by chopping up the CSV file with something like "head -1000000". Actually, it blows up between 999999 and 1000000 lines! Here's load-sales.py
and here's the test data generation script:
|
the problem is the price field is a |
Thanks. --Dan On Fri, Aug 16, 2013 at 5:22 AM, jreback [email protected] wrote:
|
I converted the price column to an integer (x100) and I'm still getting the same result. That is, for data files below 1 million lines it works. But above 1M I get:
change to load-sales.py:
change to sales-gen.sh:
|
the salesperson/customer fields are now integers? A00000, A are not there in your file at all? |
Heres a couple of more tipes access the data like this (which is in effect a short-cut for the type of lookup y ou are doing)
Now just boolean search (to make sure that the key is in fact there; you can search with one of more keys
unique values in a column
|
You're right, I had copied the wrong snippet. The following query works:
Thanks for your help with this! --Dan On Sat, Aug 17, 2013 at 7:16 AM, jreback [email protected] wrote:
|
I've run into a data-dependent bug with MultiIndex. Attached are a test script and a data file generation script:
My problem is that sales-data.py breaks with large CSV files but works with smaller CSV files. When it fails it displays the following error message:
I know that MultiIndexes must be pre-sorted. So sales-gen.sh makes sure that the first four columns in the auto-generated CSV files are ordered. BTW, I'm not sure why lexsort is getting called.
sales-data.py
sales-gen.sh
The text was updated successfully, but these errors were encountered: