-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: sum() should default to numeric_only=True #58234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You may want to read through #46072 Do you have any data that backs this up?
|
Wouldn't it have made more sense to treat timedelta as numeric rather than appending every string in a dataframe upon invoking sum()? |
Thanks for raising this issue!
Can you provide an example that causes your notebook to crash? When users see code:
I think it's reasonable for them to expect the result contains two rows indexed by "A" and "B". This would not be the case if we used
makes it very clear that columns may be dropped. In general, I am against the silent dropping of data, and I think that applies here. |
Concatenating strings is a very time-consuming process and on VS code these processes when terminated often create problems with the kernel.
This is probably a good generalized idea, but other than in the bug thread that this issue helps resolve (which could be resolved more efficiently by treating timedeltas as numeric) what is the exact use case of concatenating all strings in a dataframe. Who would ever want that? Imagine you're looking at a dataframe with a million rows. When would I ever need to concatenate all the strings in the 'City', 'Employee' or 'Department' field to create a multimillion character string with no seperator. Whereas in the past .sum() would have quickly described the total summable values of such as dataframe, now it instead attempts an unexpected concatenation that takes multiple days to execute. Why not just return a type error here? |
I've used it to condense a DataFrame into a summary, e.g.
Currently by default, pandas stores strings as a NumPy object dtype. I do not think we should differ in behavior from NumPy here.
In addition, I believe there is no way for us to know that a column of dtype object contains all strings without inspecting every value. On the other hand, with
|
Yeah agreed that a default that would automatically drop columns would be surprising by default and not a great user experience. Thanks for the suggestion but closing |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
For exploratory analysis and data exploration having .sum() default to all fields out of the groupby (including text fields) is super annoying for the core community. Right now the existing functionality concatenates all string fields outside of the groupby. Not only is this such an edge case (Under what circumstances would you want to concatenate every single string value in a text field with no separator?) it is super computationally expensive. Many IDEs for Jupyter notebooks crash when .sum() is used, which requires the user to restart their Jupyter notebook instance.
Is it more "pythonic"? Arguably. Sure the + operator concatenates strings in core python, but SUM is a human word that means
A sum is not two strings concatenated.
Perhaps this could be reverted to help out the majority of analysts/data scientists that just want a convenient tool that works as expected by humans and not an idealistic pure python implementation that exists just to satisfy a textbook.
Feature Description
.sum(numeric_only=True) is the default setting for .sum()
Alternative Solutions
remove numeric_only as an option for .sum() and create a string_concat option or require users to use a lambda function for string concatenation.
ie: df.groupby(['date']).string_concat(seperator=',')
Additional Context
No response
The text was updated successfully, but these errors were encountered: