Skip to content

BUG / ENH: Implement groupby_helper funcs for int #16676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gfyoung opened this issue Jun 12, 2017 · 4 comments · Fixed by #54263
Closed

BUG / ENH: Implement groupby_helper funcs for int #16676

gfyoung opened this issue Jun 12, 2017 · 4 comments · Fixed by #54263
Labels
API - Consistency Internal Consistency of API/Behavior Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2017

From #15091 (at 028188):

from pandas import DataFrame
df = DataFrame([[1, 2, 11111111111111111]], columns=['index', 'type', 'value'])
df.pivot_table(index='index', columns='type', values='value')
type                   2
index                   
1      11111111111111112

When we pivot, we have to aggregate the values we group together by the index and columns parameters. When we specify mean as the aggregator, we eventually get around to calling _get_cython_function, which searches for an implemention of mean in groupby.pyx for integers. However, groupby_helper.pxi only defines them for floats, so the data is then cast to float for aggregating before being reconverted back to int in the final result, leading to the mysterious increment due to rounding.

Had there been an implementation of mean for int, then this wouldn't happen. However, implementing mean for int isn't straightforward because we can't guarantee returning int as the float implementations can't guarantee returning float without losing precision (which is the contract in the groupby_helper.pxi template).

@gfyoung gfyoung changed the title BUG/ENH: Implement groupby_helper funcs for int BUG / ENH: Implement groupby_helper funcs for int Jun 12, 2017
@jreback
Copy link
Contributor

jreback commented Jun 12, 2017

so this is actually straightforward. we add a groupby_mean that accepts int, but returns float.

@jreback jreback added this to the Next Major Release milestone Jun 12, 2017
@gfyoung
Copy link
Member Author

gfyoung commented Oct 17, 2017

I wish it was that simple, though even when implementing those functions, I still see rounding issues, as can be seen from an attempted patch off of 5bf7f9:

group-mean-patch.patch

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

im not clear on what the ask is here. @gfyoung what would the suggested helper(s) be doing?

@rhshadrach
Copy link
Member

In the case the input is an integer (is_integer_dtype) and the output is not, we downcast. I think instead, users should be expecting mean to always produce floats. This downcasting is done for any op, and other ops (e.g. a UDF) that produces floats could produce unexpected results. Plus, this is value-dependent behavior.

Removing the downcasting leads to 26 failed test. I haven't looked into these yet to see if there are any surprises, or if they are all just because of the dtype change for mean.

Marking this as a good first issue for investigation. The downcasting to remove is done in the code below. Once removed, try fixing up the tests and see if there are any cases where not downcasting leads to unexpected / undesirable results.

# gh-21133
# we want to down cast if
# the original values are ints
# as we grouped with a NaN value
# and then dropped, coercing to floats
for v in values:
if (
v in data
and is_integer_dtype(data[v])
and v in agged
and not is_integer_dtype(agged[v])
):
if not isinstance(agged[v], ABCDataFrame) and isinstance(
data[v].dtype, np.dtype
):
# exclude DataFrame case bc maybe_downcast_to_dtype expects
# ArrayLike
# e.g. test_pivot_table_multiindex_columns_doctest_case
# agged.columns is a MultiIndex and 'v' is indexing only
# on its first level.
agged[v] = maybe_downcast_to_dtype(agged[v], data[v].dtype)

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Groupby API - Consistency Internal Consistency of API/Behavior and removed Groupby labels Jul 28, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Dtype Conversions Unexpected or buggy dtype conversions good first issue Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants