BUG / ENH: Implement groupby_helper funcs for int #16676

gfyoung · 2017-06-12T06:03:10Z

from pandas import DataFrame
df = DataFrame([[1, 2, 11111111111111111]], columns=['index', 'type', 'value'])
df.pivot_table(index='index', columns='type', values='value')
type                   2
index                   
1      11111111111111112

When we pivot, we have to aggregate the values we group together by the index and columns parameters. When we specify mean as the aggregator, we eventually get around to calling _get_cython_function, which searches for an implemention of mean in groupby.pyx for integers. However, groupby_helper.pxi only defines them for floats, so the data is then cast to float for aggregating before being reconverted back to int in the final result, leading to the mysterious increment due to rounding.

Had there been an implementation of mean for int, then this wouldn't happen. However, implementing mean for int isn't straightforward because we can't guarantee returning int as the float implementations can't guarantee returning float without losing precision (which is the contract in the groupby_helper.pxi template).

The text was updated successfully, but these errors were encountered:

jreback · 2017-06-12T10:57:37Z

so this is actually straightforward. we add a groupby_mean that accepts int, but returns float.

gfyoung · 2017-10-17T07:21:36Z

I wish it was that simple, though even when implementing those functions, I still see rounding issues, as can be seen from an attempted patch off of 5bf7f9:

group-mean-patch.patch

jbrockmendel · 2023-02-07T22:11:11Z

im not clear on what the ask is here. @gfyoung what would the suggested helper(s) be doing?

rhshadrach · 2023-07-15T17:08:06Z

In the case the input is an integer (is_integer_dtype) and the output is not, we downcast. I think instead, users should be expecting mean to always produce floats. This downcasting is done for any op, and other ops (e.g. a UDF) that produces floats could produce unexpected results. Plus, this is value-dependent behavior.

Removing the downcasting leads to 26 failed test. I haven't looked into these yet to see if there are any surprises, or if they are all just because of the dtype change for mean.

Marking this as a good first issue for investigation. The downcasting to remove is done in the code below. Once removed, try fixing up the tests and see if there are any cases where not downcasting leads to unexpected / undesirable results.

pandas/pandas/core/reshape/pivot.py

Lines 175 to 195 in db27c36

    
           # gh-21133 
        
           # we want to down cast if 
        
           # the original values are ints 
        
           # as we grouped with a NaN value 
        
           # and then dropped, coercing to floats 
        
           for v in values: 
        
               if ( 
        
                   v in data 
        
                   and is_integer_dtype(data[v]) 
        
                   and v in agged 
        
                   and not is_integer_dtype(agged[v]) 
        
               ): 
        
                   if not isinstance(agged[v], ABCDataFrame) and isinstance( 
        
                       data[v].dtype, np.dtype 
        
                   ): 
        
                       # exclude DataFrame case bc maybe_downcast_to_dtype expects 
        
                       #  ArrayLike 
        
                       # e.g. test_pivot_table_multiindex_columns_doctest_case 
        
                       #  agged.columns is a MultiIndex and 'v' is indexing only 
        
                       #  on its first level. 
        
                       agged[v] = maybe_downcast_to_dtype(agged[v], data[v].dtype)

gfyoung changed the title ~~BUG/ENH: Implement groupby_helper funcs for int~~ BUG / ENH: Implement groupby_helper funcs for int Jun 12, 2017

jreback added Bug Difficulty Intermediate Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Jun 12, 2017

jreback added this to the Next Major Release milestone Jun 12, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach added the good first issue label Jul 15, 2023

austinauyeung mentioned this issue Jul 26, 2023

BUG: pivot_table mean of integer input casted back to int #54263

Merged

5 tasks

rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Groupby API - Consistency Internal Consistency of API/Behavior and removed Groupby labels Jul 28, 2023

rhshadrach added this to the 2.1 milestone Jul 28, 2023

rhshadrach closed this as completed in #54263 Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG / ENH: Implement groupby_helper funcs for int #16676

BUG / ENH: Implement groupby_helper funcs for int #16676

gfyoung commented Jun 12, 2017 •

edited

Loading

jreback commented Jun 12, 2017

gfyoung commented Oct 17, 2017 •

edited

Loading

jbrockmendel commented Feb 7, 2023

rhshadrach commented Jul 15, 2023

BUG / ENH: Implement groupby_helper funcs for int #16676

BUG / ENH: Implement groupby_helper funcs for int #16676

Comments

gfyoung commented Jun 12, 2017 • edited Loading

jreback commented Jun 12, 2017

gfyoung commented Oct 17, 2017 • edited Loading

jbrockmendel commented Feb 7, 2023

rhshadrach commented Jul 15, 2023

gfyoung commented Jun 12, 2017 •

edited

Loading

gfyoung commented Oct 17, 2017 •

edited

Loading