Pandas crosstab margins double counting if values specifies a different field than rows/cols #4003

brandonkane · 2013-06-23T13:09:33Z

See http://stackoverflow.com/questions/17236852/pandas-crosstab-double-counting-when-using-two-aggregate-functions for discussion.

To reproduce:
Create a test dataframe:

df = DataFrame({'A': ['one', 'one', 'two', 'three'] * 6, 'B': ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D': np.random.randn(24), 'E': np.random.randn(24)})

Crosstab gives the expected results:

crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True)
C        bar  foo  All
A     B               
one   A    2    2    4
      B    2    2    4
      C    2    2    4
three A    2    0    2
      B    0    2    2
      C    2    0    2
two   A    0    2    2
      B    2    0    2
      C    0    2    2
All       12   12   24

However, if you try to get mean and count in the same crosstab, specifying values for the mean, the crosstab will double count the elements when calculating the margin total:

crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size, np.mean], values=df['D'])

         size                mean                    
C         bar  foo  All       bar       foo       All
A     B                                              
one   A     2    2    4  0.245998  0.076366  0.161182
      B     2    2    4 -0.739757  0.137780 -0.300988
      C     2    2    4 -1.555759 -1.446554 -1.501157
three A     2  NaN    2  1.216109       NaN  1.216109
      B   NaN    2    2       NaN  0.255482  0.255482
      C     2  NaN    2  0.732448       NaN  0.732448
two   A   NaN    2    2       NaN -0.273747 -0.273747
      B     2  NaN    2 -0.001649       NaN -0.001649
      C   NaN    2    2       NaN  0.685422  0.685422
All        24   24   24 -0.017102 -0.094208 -0.055655

TomAugspurger · 2013-06-23T14:19:32Z

Just a note. The problem isn't that two functions are being passed. It's something to do with np.size being aggregated in a non-obvious way.

In [32]: expected = crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, values=df[
                             'D'], aggfunc=len)

In [33]: expected
Out[33]: 
         C        bar  foo  All
         A     B               
         one   A    2    2    4
               B    2    2    4
               C    2    2    4
         three A    2  NaN    2
               B  NaN    2    2
               C    2  NaN    2
         two   A  NaN    2    2
               B    2  NaN    2
               C  NaN    2    2
         All       12   12   24

In [34]: odd = crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, values=df['D'],
                        aggfunc=np.size)

In [37]: odd
Out[37]: 
         C        bar  foo  All
         A     B               
         one   A    2    2    4
               B    2    2    4
               C    2    2    4
         three A    2  NaN    2
               B  NaN    2    2
               C    2  NaN    2
         two   A  NaN    2    2
               B    2  NaN    2
               C  NaN    2    2
         All       24   24   24

TomAugspurger · 2013-06-23T14:54:34Z

OK, I'm leaning toward this being the correct behavior. The relevant line is

row_margin = data[cols + values].groupby(cols).agg(aggfunc)

We're working with the following groups:

bar       C  __dummy__
3   bar   1.523030
4   bar  -0.234153
5   bar  -0.234137
9   bar   0.542560
10  bar  -0.463418
11  bar  -0.465730
15  bar  -0.562288
16  bar  -1.012831
17  bar   0.314247
21  bar  -0.225776
22  bar   0.067528
23  bar  -1.424748
foo       C  __dummy__
0   foo   0.496714
1   foo  -0.138264
2   foo   0.647689
6   foo   1.579213
7   foo   0.767435
8   foo  -0.469474
12  foo   0.241962
13  foo  -1.913280
14  foo  -1.724918
18  foo  -0.908024
19  foo  -1.412304
20  foo   1.465649

With aggfunc = len, it aggregates to 12 and 12 for foo and bar, since each DataFrame is 12 items long.

With aggfunc = np.size, it aggregates to 24 and 24 since each DataFrame is 12x2.

So the behavior is not necessarily intuitive, but it is correct.

TomAugspurger · 2013-06-23T15:28:29Z

I guess it's also worth pointing out why they both get the same values for the All total. That bit of code is operating on just the column passed as values, which makes it a Series. So len and np.size will both give the same answer in this case.

brandonkane · 2013-06-23T20:34:43Z

Tom, makes sense, thanks for investigating. I guess my intuition would have been that in the case of the 12x2 dataframe, it should have excluded one of the columns for each margin. The margins should be an aggregation of the data above it, and in this case they aren't. I would be curious to see what others would expect in this scenario.

I'm fine just using len as the aggfunc instead of size, so this isn't a high priority thing.

dragoljub · 2014-07-29T19:10:24Z

I just ran into this issue myself using the df.pivot_table(..., margins=True). I always felt it was somehow wrong to use np.size to count the number of rows in the groupings and now I have a reason to avoid it. In my case I had both the row and column sums doubled. Interestingly doing the pivot table on the above DataFrame does not appear to cause the problem?

df.pivot_table(values='D', rows=['A', 'B'], cols='C', aggfunc=np.size, fill_value=0, margins=True)
Out[14]: 
C        bar  foo  All
A     B               
one   A    2    2    4
      B    2    2    4
      C    2    2    4
three A    2    0    2
      B    0    2    2
      C    2    0    2
two   A    0    2    2
      B    2    0    2
      C    0    2    2
All       12   12   24

df.pivot_table(values='D', rows=['A', 'B'], cols='C', aggfunc=len, fill_value=0, margins=True)
Out[15]: 
C        bar  foo  All
A     B               
one   A    2    2    4
      B    2    2    4
      C    2    2    4
three A    2    0    2
      B    0    2    2
      C    2    0    2
two   A    0    2    2
      B    2    0    2
      C    0    2    2
All       12   12   24

jreback · 2016-03-09T21:47:46Z

@nickeubank can you confirm this is still an issue?

nickeubank · 2016-03-09T22:03:55Z

@jreback Looks fixed to me --

pd.crosstab(index=[df['A'],df['B']], columns=[df['C']], margins=True, aggfunc=[np.size, np.mean], values=df['D'])
Out[243]: 
        size              mean                    
C        bar foo All       bar       foo       All
A     B                                           
one   A    2   2   4  1.152823  0.455740  0.804282
      B    2   2   4 -0.033598  0.037608  0.002005
      C    2   2   4  0.081817 -0.449494 -0.183838
three A    2 NaN   2 -0.146240       NaN -0.146240
      B  NaN   2   2       NaN -0.269695 -0.269695
      C    2 NaN   2  0.063828       NaN  0.063828
two   A  NaN   2   2       NaN -0.496059 -0.496059
      B    2 NaN   2 -0.394447       NaN -0.394447
      C  NaN   2   2       NaN  0.863030  0.863030
All       12  12  24  0.120697  0.023522  0.072109

See 12, 12, 24 along bottom row?

jreback · 2016-03-09T22:12:21Z

hmm, ok. so if you want to do a PR for a confirming test would be great.

if you want to see if you can figure out where this was originally fixed we can reference.

missing assert

* origin/master: (22 commits) BUG: astype falsely converts inf to integer (GH14265) (pandas-dev#14343) BUG: Apply min_itemsize to index even when not appending DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys (pandas-dev#14788) BLD: missing - on secure BLD: new access token on pandas-dev TST: Test DatetimeIndex weekend offset (pandas-dev#14853) BLD: escape GH_TOKEN in build_docs TST: Correct results with np.size and crosstab (pandas-dev#4003) (pandas-dev#14755) Frame benchmarking sum instead of mean (pandas-dev#14824) CLN: lint of test_base.py BUG: Allow TZ-aware DatetimeIndex in merge_asof() (pandas-dev#14844) BUG: GH11847 Unstack with mixed dtypes coerces everything to object TST: skip testing on windows for specific formatting which sometimes hangs (pandas-dev#14851) BLD: try new gh token for pandas-docs CLN/PERF: clean-up of the benchmarks (pandas-dev#14099) ENH: add timedelta as valid type for interpolate with method='time' (pandas-dev#14799) DOC: add section on groupby().rolling/expanding/resample (pandas-dev#14801) TST: add test to confirm GH14606 (specify category dtype for empty) (pandas-dev#14752) BLD: use org name in build-docs.sh BF(TST): use = (native) instead of < (little endian) for target data types (pandas-dev#14832) ...

* commit 'v0.19.0-174-g81a2f79': (156 commits) BLD: escape GH_TOKEN in build_docs TST: Correct results with np.size and crosstab (pandas-dev#4003) (pandas-dev#14755) Frame benchmarking sum instead of mean (pandas-dev#14824) CLN: lint of test_base.py BUG: Allow TZ-aware DatetimeIndex in merge_asof() (pandas-dev#14844) BUG: GH11847 Unstack with mixed dtypes coerces everything to object TST: skip testing on windows for specific formatting which sometimes hangs (pandas-dev#14851) BLD: try new gh token for pandas-docs CLN/PERF: clean-up of the benchmarks (pandas-dev#14099) ENH: add timedelta as valid type for interpolate with method='time' (pandas-dev#14799) DOC: add section on groupby().rolling/expanding/resample (pandas-dev#14801) TST: add test to confirm GH14606 (specify category dtype for empty) (pandas-dev#14752) BLD: use org name in build-docs.sh BF(TST): use = (native) instead of < (little endian) for target data types (pandas-dev#14832) ENH: Introduce UnsortedIndexError GH11897 (pandas-dev#14762) ENH: Add the ability to have a separate title for each subplot when plotting (pandas-dev#14753) DOC: Fix grammar and formatting typos (pandas-dev#14803) BLD: try new build credentials for pandas-docs TST: Test pivot with categorical data MAINT: Cleanup pandas/src/parser (pandas-dev#14740) ...

release 0.19.1 was from release branch * releases: (156 commits) BLD: escape GH_TOKEN in build_docs TST: Correct results with np.size and crosstab (pandas-dev#4003) (pandas-dev#14755) Frame benchmarking sum instead of mean (pandas-dev#14824) CLN: lint of test_base.py BUG: Allow TZ-aware DatetimeIndex in merge_asof() (pandas-dev#14844) BUG: GH11847 Unstack with mixed dtypes coerces everything to object TST: skip testing on windows for specific formatting which sometimes hangs (pandas-dev#14851) BLD: try new gh token for pandas-docs CLN/PERF: clean-up of the benchmarks (pandas-dev#14099) ENH: add timedelta as valid type for interpolate with method='time' (pandas-dev#14799) DOC: add section on groupby().rolling/expanding/resample (pandas-dev#14801) TST: add test to confirm GH14606 (specify category dtype for empty) (pandas-dev#14752) BLD: use org name in build-docs.sh BF(TST): use = (native) instead of < (little endian) for target data types (pandas-dev#14832) ENH: Introduce UnsortedIndexError GH11897 (pandas-dev#14762) ENH: Add the ability to have a separate title for each subplot when plotting (pandas-dev#14753) DOC: Fix grammar and formatting typos (pandas-dev#14803) BLD: try new build credentials for pandas-docs TST: Test pivot with categorical data MAINT: Cleanup pandas/src/parser (pandas-dev#14740) ...

…das-dev#14755) missing assert

jreback mentioned this issue Mar 9, 2016

ENH: Add normalize to crosstab #12569

Closed

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Novice labels Mar 9, 2016

jreback modified the milestones: 0.18.1, Someday Mar 9, 2016

nickeubank mentioned this issue Mar 9, 2016

BUG: Crosstab margins ignoring dropna #12577

Closed

jreback added the Testing pandas testing functions or related to the test suite label Mar 9, 2016

jreback modified the milestones: 0.18.1, Next Major Release Apr 25, 2016

mroeschke added a commit to mroeschke/pandas that referenced this issue Nov 27, 2016

TST: Correct results with np.size and crosstab (pandas-dev#4003)

ebac1fd

mroeschke mentioned this issue Nov 27, 2016

TST: Correct results with np.size and crosstab (#4003) #14755

Merged

3 tasks

mroeschke added a commit to mroeschke/pandas that referenced this issue Dec 10, 2016

TST: Correct results with np.size and crosstab (pandas-dev#4003)

7ae1072

mroeschke added a commit to mroeschke/pandas that referenced this issue Dec 10, 2016

TST: Correct results with np.size and crosstab (pandas-dev#4003)

e703a8d

missing assert

jreback modified the milestones: 0.20.0, Next Major Release, 0.19.2 Dec 10, 2016

jreback closed this as completed in #14755 Dec 10, 2016

jreback pushed a commit that referenced this issue Dec 10, 2016

TST: Correct results with np.size and crosstab (#4003) (#14755)

b6de920

missing assert

ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016

TST: Correct results with np.size and crosstab (pandas-dev#4003) (pan…

9bfc678

…das-dev#14755) missing assert

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas crosstab margins double counting if values specifies a different field than rows/cols #4003

Pandas crosstab margins double counting if values specifies a different field than rows/cols #4003

brandonkane commented Jun 23, 2013

TomAugspurger commented Jun 23, 2013

TomAugspurger commented Jun 23, 2013

TomAugspurger commented Jun 23, 2013

brandonkane commented Jun 23, 2013

dragoljub commented Jul 29, 2014

jreback commented Mar 9, 2016

nickeubank commented Mar 9, 2016

jreback commented Mar 9, 2016

Pandas crosstab margins double counting if values specifies a different field than rows/cols #4003

Pandas crosstab margins double counting if values specifies a different field than rows/cols #4003

Comments

brandonkane commented Jun 23, 2013

TomAugspurger commented Jun 23, 2013

TomAugspurger commented Jun 23, 2013

TomAugspurger commented Jun 23, 2013

brandonkane commented Jun 23, 2013

dragoljub commented Jul 29, 2014

jreback commented Mar 9, 2016

nickeubank commented Mar 9, 2016

jreback commented Mar 9, 2016