-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: groupby.max() should not cast int to int64 but keep original data type #42275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting this @rd-andreas-lay! This happens because our groupby algorithms only support specific types, so we need to cast to one which is supported. Wouldn't be hard to support more types for |
Hi @mzeitlin11 ! I want to contribute to this issue in Pandas. Do you want to add support to int8? Can I work on it? |
@arubiales that would be great! |
Thanks @mzeitlin11 I will go for it! Any useful information as for example, the module of pandas where is located, or files, and other things to consider, is appreciated. |
This is a pretty complicated issue, so there are a lot of things to consider :), but please reach out if you'd like any help:
|
Yes I know that it will take time, but I have a strong knowledge of C and Cython, so I think that with time I will do it. Thank you for the info, I'm going to review it and take and overall idea of how everything is connected. |
@mzeitlin11 @rd-andreas-lay . Sorry but I'm triying to reproduce the data type change with a minimum replicable example and it's impossible for me, so I'm missing something here. I'm triying the following
Output:
|
@arubiales In my understanding the final data type is recast to the original data type later on, the conversion to float is just intermediate (still potentially causing memory allocation errors - in my example an increase from 10GB to 70GB). I'd have to run an example through the debugger though to see where the re-casting to int8 happens. If you check your memory consumption running the example on larger dataframe, you should see an increase in memory while processing, the final result will again be smaller due the recasting to int8. Basically an inverted V shape in memory usage. |
closed by #46745 |
Is your feature request related to a problem?
In pandas version 1.2.5., using groupby.max() on a large matrix of int8 datatype 0/1 values, pandas casts the dataframe to int64, resulting in
MemoryError: Unable to allocate 76.4 GiB for an array with shape (1915674, 5356) and data type int64
Traceback:
Describe the solution you'd like
Keep the original datatype, in this case int8.
The text was updated successfully, but these errors were encountered: