Skip to content

crosstab's dependency on common index produces undesirable error #20496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsh9 opened this issue Mar 27, 2018 · 5 comments
Open

crosstab's dependency on common index produces undesirable error #20496

jsh9 opened this issue Mar 27, 2018 · 5 comments
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jsh9
Copy link

jsh9 commented Mar 27, 2018

Code Sample, a copy-pastable example if possible

I found that if passing two pandas Series (with same length, but different indices) to crosstab(), the cross tabulation result becomes incorrect.

# an array of length 15, with index = 0,2,4,6,8,...
x = pd.Series([1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3], index=range(0,30,2))

# an array of length 15, with index = 0,1,2,3,4,...
y = pd.Series([0,0,0,0,1, 1,1,1,1,0, 1,1,1,1,0], index=range(0,15,1))

# convert either one to numpy array, eliminating index
print(pd.crosstab(np.array(x), y, margins=True))

# pass them to crosstab() as is, keeping their (different) indices
print(pd.crosstab(x, y, margins=True))

The output:

col_0  0  1  All
row_0           
1      4  1    5
2      1  4    5
3      1  4    5
All    6  9   15

col_0  0  1  All
row_0           
1      2  3    5
2      1  2    3
All    3  5    8

Problem description

The second output is problematic (look at the margins and the total count, 8), because when crosstab() aggregates x and y as Series, it only looks at the elements with common indices, which has the potential to omit some (even all) values.

On the other hand, if either x or y is passed in as a numpy array (i.e., without any index), then the numpy array "adopts" the index of the other array, resulting in a correct result.

I am not sure whether such a behavior is by design or not. If so, maybe a warning can be raised telling users to expect strange cross tabulation results?

Output of pd.show_versions()

python: 3.6.3.final.0
pandas: 0.22.0
numpy: 1.13.3

@TomAugspurger
Copy link
Contributor

I think it's by design, if I'm reading the docs at http://pandas-docs.github.io/pandas-docs-travis/generated/pandas.crosstab.html correctly

In the event that there aren’t overlapping indexes an empty DataFrame will be returned.

That could be rephrased slightly to mention alignment, which is the term we usually use.

maybe a warning can be raised telling users to expect strange cross tabulation results?

I don't think that's desired here. Alignment is the default behavior in pandas, and I think it's expected by most people who've used pandas for a while. Warnings are too easy to be annoying or missed in non-interactive settings.

@TomAugspurger TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Docs labels Mar 27, 2018
@jsh9
Copy link
Author

jsh9 commented Mar 27, 2018

Thank you for the comments.

In this case, may I suggest some updates to the documentation? It would also be helpful to mention the behavior when Python lists and/or numpy arrays are provided.

Also, the explanations of values and aggfunc parameters are quite obscure, and there are no examples below to demonstrate them. May I also suggest improving this?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 27, 2018 via email

@jsh9
Copy link
Author

jsh9 commented Mar 27, 2018

I can help with explaining the behaviors when passing numpy arrays or Python lists, but I really don't understand how to use values and aggfunc.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 27, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

2 participants