-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pandas crosstab margins double counting if values specifies a different field than rows/cols #4003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just a note. The problem isn't that two functions are being passed. It's something to do with np.size being aggregated in a non-obvious way. In [32]: expected = crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, values=df[
'D'], aggfunc=len)
In [33]: expected
Out[33]:
C bar foo All
A B
one A 2 2 4
B 2 2 4
C 2 2 4
three A 2 NaN 2
B NaN 2 2
C 2 NaN 2
two A NaN 2 2
B 2 NaN 2
C NaN 2 2
All 12 12 24
In [34]: odd = crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, values=df['D'],
aggfunc=np.size)
In [37]: odd
Out[37]:
C bar foo All
A B
one A 2 2 4
B 2 2 4
C 2 2 4
three A 2 NaN 2
B NaN 2 2
C 2 NaN 2
two A NaN 2 2
B 2 NaN 2
C NaN 2 2
All 24 24 24 |
OK, I'm leaning toward this being the correct behavior. The relevant line is row_margin = data[cols + values].groupby(cols).agg(aggfunc) We're working with the following groups: bar C __dummy__
3 bar 1.523030
4 bar -0.234153
5 bar -0.234137
9 bar 0.542560
10 bar -0.463418
11 bar -0.465730
15 bar -0.562288
16 bar -1.012831
17 bar 0.314247
21 bar -0.225776
22 bar 0.067528
23 bar -1.424748
foo C __dummy__
0 foo 0.496714
1 foo -0.138264
2 foo 0.647689
6 foo 1.579213
7 foo 0.767435
8 foo -0.469474
12 foo 0.241962
13 foo -1.913280
14 foo -1.724918
18 foo -0.908024
19 foo -1.412304
20 foo 1.465649 With With So the behavior is not necessarily intuitive, but it is correct. |
I guess it's also worth pointing out why they both get the same values for the |
Tom, makes sense, thanks for investigating. I guess my intuition would have been that in the case of the 12x2 dataframe, it should have excluded one of the columns for each margin. The margins should be an aggregation of the data above it, and in this case they aren't. I would be curious to see what others would expect in this scenario. I'm fine just using len as the aggfunc instead of size, so this isn't a high priority thing. |
I just ran into this issue myself using the df.pivot_table(..., margins=True). I always felt it was somehow wrong to use np.size to count the number of rows in the groupings and now I have a reason to avoid it. In my case I had both the row and column sums doubled. Interestingly doing the pivot table on the above DataFrame does not appear to cause the problem? df.pivot_table(values='D', rows=['A', 'B'], cols='C', aggfunc=np.size, fill_value=0, margins=True)
Out[14]:
C bar foo All
A B
one A 2 2 4
B 2 2 4
C 2 2 4
three A 2 0 2
B 0 2 2
C 2 0 2
two A 0 2 2
B 2 0 2
C 0 2 2
All 12 12 24
df.pivot_table(values='D', rows=['A', 'B'], cols='C', aggfunc=len, fill_value=0, margins=True)
Out[15]:
C bar foo All
A B
one A 2 2 4
B 2 2 4
C 2 2 4
three A 2 0 2
B 0 2 2
C 2 0 2
two A 0 2 2
B 2 0 2
C 0 2 2
All 12 12 24 |
@nickeubank can you confirm this is still an issue? |
@jreback Looks fixed to me --
See 12, 12, 24 along bottom row? |
hmm, ok. so if you want to do a PR for a confirming test would be great. if you want to see if you can figure out where this was originally fixed we can reference. |
* origin/master: (22 commits) BUG: astype falsely converts inf to integer (GH14265) (pandas-dev#14343) BUG: Apply min_itemsize to index even when not appending DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys (pandas-dev#14788) BLD: missing - on secure BLD: new access token on pandas-dev TST: Test DatetimeIndex weekend offset (pandas-dev#14853) BLD: escape GH_TOKEN in build_docs TST: Correct results with np.size and crosstab (pandas-dev#4003) (pandas-dev#14755) Frame benchmarking sum instead of mean (pandas-dev#14824) CLN: lint of test_base.py BUG: Allow TZ-aware DatetimeIndex in merge_asof() (pandas-dev#14844) BUG: GH11847 Unstack with mixed dtypes coerces everything to object TST: skip testing on windows for specific formatting which sometimes hangs (pandas-dev#14851) BLD: try new gh token for pandas-docs CLN/PERF: clean-up of the benchmarks (pandas-dev#14099) ENH: add timedelta as valid type for interpolate with method='time' (pandas-dev#14799) DOC: add section on groupby().rolling/expanding/resample (pandas-dev#14801) TST: add test to confirm GH14606 (specify category dtype for empty) (pandas-dev#14752) BLD: use org name in build-docs.sh BF(TST): use = (native) instead of < (little endian) for target data types (pandas-dev#14832) ...
* commit 'v0.19.0-174-g81a2f79': (156 commits) BLD: escape GH_TOKEN in build_docs TST: Correct results with np.size and crosstab (pandas-dev#4003) (pandas-dev#14755) Frame benchmarking sum instead of mean (pandas-dev#14824) CLN: lint of test_base.py BUG: Allow TZ-aware DatetimeIndex in merge_asof() (pandas-dev#14844) BUG: GH11847 Unstack with mixed dtypes coerces everything to object TST: skip testing on windows for specific formatting which sometimes hangs (pandas-dev#14851) BLD: try new gh token for pandas-docs CLN/PERF: clean-up of the benchmarks (pandas-dev#14099) ENH: add timedelta as valid type for interpolate with method='time' (pandas-dev#14799) DOC: add section on groupby().rolling/expanding/resample (pandas-dev#14801) TST: add test to confirm GH14606 (specify category dtype for empty) (pandas-dev#14752) BLD: use org name in build-docs.sh BF(TST): use = (native) instead of < (little endian) for target data types (pandas-dev#14832) ENH: Introduce UnsortedIndexError GH11897 (pandas-dev#14762) ENH: Add the ability to have a separate title for each subplot when plotting (pandas-dev#14753) DOC: Fix grammar and formatting typos (pandas-dev#14803) BLD: try new build credentials for pandas-docs TST: Test pivot with categorical data MAINT: Cleanup pandas/src/parser (pandas-dev#14740) ...
release 0.19.1 was from release branch * releases: (156 commits) BLD: escape GH_TOKEN in build_docs TST: Correct results with np.size and crosstab (pandas-dev#4003) (pandas-dev#14755) Frame benchmarking sum instead of mean (pandas-dev#14824) CLN: lint of test_base.py BUG: Allow TZ-aware DatetimeIndex in merge_asof() (pandas-dev#14844) BUG: GH11847 Unstack with mixed dtypes coerces everything to object TST: skip testing on windows for specific formatting which sometimes hangs (pandas-dev#14851) BLD: try new gh token for pandas-docs CLN/PERF: clean-up of the benchmarks (pandas-dev#14099) ENH: add timedelta as valid type for interpolate with method='time' (pandas-dev#14799) DOC: add section on groupby().rolling/expanding/resample (pandas-dev#14801) TST: add test to confirm GH14606 (specify category dtype for empty) (pandas-dev#14752) BLD: use org name in build-docs.sh BF(TST): use = (native) instead of < (little endian) for target data types (pandas-dev#14832) ENH: Introduce UnsortedIndexError GH11897 (pandas-dev#14762) ENH: Add the ability to have a separate title for each subplot when plotting (pandas-dev#14753) DOC: Fix grammar and formatting typos (pandas-dev#14803) BLD: try new build credentials for pandas-docs TST: Test pivot with categorical data MAINT: Cleanup pandas/src/parser (pandas-dev#14740) ...
See http://stackoverflow.com/questions/17236852/pandas-crosstab-double-counting-when-using-two-aggregate-functions for discussion.
To reproduce:
Create a test dataframe:
Crosstab gives the expected results:
However, if you try to get mean and count in the same crosstab, specifying values for the mean, the crosstab will double count the elements when calculating the margin total:
The text was updated successfully, but these errors were encountered: