-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: max rows in read_csv #14131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what is the error? |
There is no error at all, I get 2852516352 rows data frame, where I should get 5e9 rows data frame. |
how do you know it contains 5e9 rows how does it parse a small chunk (pass nrows=10) assume you have 200 gb available (you need generally 2x memory to process) assuming 2x int64s per row |
I was able to load that file in other tools and they produces 5e9 dataset. x = pd.read_csv("X5e9_2c.csv", nrows=10)
x
KEY X2
#0 2632858426 -4008534609
#1 3176486913 -3394302982
#2 3235589527 -4982919314
#3 1017229071 3793469468
#4 1039519144 2740983791
#5 2504311956 -364044414
#6 43002703 -1822527251
#7 2984147242 -1297084189
#8 2604109368 -2965381672
#9 178979971 -4855058881 |
looks like an int32 overflow if I have to guess |
show .dtypes |
cc @gfyoung |
x.dtypes
#KEY int64
#X2 int64
#dtype: object |
so seems suspicious. |
as a work-around, I suspect this would work via chunks just fine, e.g.
|
if you would like to try something: https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L1341 almost everything is a c-int, which is 32-bit, so when this hits its max it rolls over I think (and becomes negative). So a soln (and prob not a big deal) is to change It may be more appropriate to use if you want to give that a try and see would be great. |
@jreback : I think your prognosis is correct. The @jangorecki : Does replacing |
iirc int on 64but machine is still a 32bit int generally |
as an aside @jangorecki you seem to be getting only 1000 mb/min |
@jreback : Perhaps... |
@jreback |
Just a two cents from me, you could use Dask DataFrames, which handle data in Pandas DataFrame chunks and can parallelize operations and perform out-of-core computations. |
@frol this doesn't help if the user needs an in-memory pandas.DataFrame. Certain operations can indeed be performed out of core / in parallel with
|
I also recently hit this bug, where I had a large (3E9) row CSV that mysteriously ended up with the following shape after
|
You're the big winner :) |
FWIW, the naive chunked approach doesn't seem to overcome the overflow. Notice in the following snippet that the final chunk is basically
Output:
FYI I'm using a gzipped csv in this analysis. |
xref #16798 for related problem |
I have 5e9 rows csv file that I'm trying to load with pandas, which appears to load it only partially and produces 2852516352 rows data frame.
Expected behaviour would be either properly read csv or an error, so it would save ~100 minutes of processing in my case.
my
pd.show_versions()
The text was updated successfully, but these errors were encountered: