Skip to content

BUG: Respect Filtered Categorical in Crosstab #13073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented May 3, 2016

Closes #12298 by stripping the Categorical elements of their original categories so that only their included values are considered during the counting / aggregation.

@jreback
Copy link
Contributor

jreback commented May 3, 2016

this should put NaN where the categories are not represented, then dropna can control this.

@gfyoung
Copy link
Member Author

gfyoung commented May 3, 2016

Why would we want dropna to control this? What advantage would you have by counting unrepresented categories as NaN?

Also, not sure how best to efficiently implement that since you would have to iterate to check which categories are not represented and then replace them all with NaN? Gymnastics much?

@jreback
Copy link
Contributor

jreback commented May 3, 2016

In [19]: result = pd.crosstab(df_filtered.col0, df_filtered.col1)

In [20]: result.iloc[2] = np.nan

In [21]: result
Out[21]: 
col1    1    2    3
col0               
a     2.0  0.0  0.0
b     1.0  1.0  0.0
c     NaN  NaN  NaN

@jreback
Copy link
Contributor

jreback commented May 3, 2016

of course the default is dropna to maybe this should simply drop all of these (but if dropna=False) then should leave.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Categorical Categorical Data Type labels May 3, 2016
@gfyoung
Copy link
Member Author

gfyoung commented May 4, 2016

  1. Don't see an explanation as to why keeping NaN is useful.
  2. Your code won't work. When you call df.pivot_table, there will be no NaN's to fill at that point. The data you will see is {rowname: Categorical,...}

@gfyoung gfyoung force-pushed the crosstab-categorical branch from 5b9952b to efd24a1 Compare May 5, 2016 15:53
@codecov-io
Copy link

codecov-io commented May 5, 2016

Current coverage is 84.14%

Merging #13073 into master will increase coverage by +<.01%

@@             master     #13073   diff @@
==========================================
  Files           138        137     -1   
  Lines         50380      50231   -149   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          42380      42265   -115   
+ Misses         8000       7966    -34   
  Partials          0          0          
  1. 4 files (not in diff) in pandas/tseries were modified. more
    • Misses -4
    • Hits -26
  2. 3 files (not in diff) in pandas/io were modified. more
    • Misses +54
  3. 6 files (not in diff) in pandas/core were modified. more
    • Misses -16
    • Hits -69
  4. 3 files (not in diff) in pandas were modified. more
    • Misses -2
    • Hits -24
  5. File pandas/io/s3.py (not in diff) was deleted. more

Powered by Codecov. Last updated by 4de83d2...5b9952b

@jreback
Copy link
Contributor

jreback commented May 5, 2016

@gfyoung I know you keep updating this, but:

  • I think you need to deal with dropna=False and
  • simply converting to object (and listifying) is not a satisfying soln

so need to map out what the output should actually be. haven't really though this thru, but
I think categorical actually means something, you DO want to propogate it. (or at least we do in some situations). So need to figure out if this is one of those.

like others to weigh in here.

@jorisvandenbossche @JanSchulz

@gfyoung
Copy link
Member Author

gfyoung commented May 5, 2016

@jreback : for some reason, there was a merge conflict on my branch locally that wasn't showing up a GitHub, hence the update. In any case, I think your point of view hinges essentially on whether or not preserving Categorical information is needed. As a contributor, I generally would side with the user (in this case, whomever filed the issue) unless there is a strong argument against, which IMHO I have not seen either here or in the issue.

@gfyoung gfyoung force-pushed the crosstab-categorical branch 11 times, most recently from 1cd038d to 84e5e33 Compare May 11, 2016 23:14
@gfyoung
Copy link
Member Author

gfyoung commented May 12, 2016

@jorisvandenbossche @JanSchulz : any thoughts on what @jreback has said?

@gfyoung gfyoung force-pushed the crosstab-categorical branch from 84e5e33 to 31db27e Compare May 12, 2016 21:22
def _make_list(arr):
# see gh-12298
if com.is_categorical(arr):
arr = Series(list(arr), name=arr.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a more efficient way to do that (than convert to list in between), by using the dtype of the categories.

@jankatins
Copy link
Contributor

jankatins commented May 12, 2016

I think categorical actually means something, you DO want to propogate it. (or at least we do in some situations).

I'm firmly in that camp: if I have categoricals and some of them are not used, I want counts for that category show up as "0". So from my perspective #12298 works as intended and I would reject this premise (#12298 (comment)) and the row should not be dropped:

if we accept the premise that we don't want to always represent categorical results (IOW we treat them like regular results), then a filtered row should be dropped.

The only exception I can think of is if there is any way to distinguish between a empty categorical because of the filter and simply a not used category in the original data. But I don't think this is possible.

@gfyoung
Copy link
Member Author

gfyoung commented May 12, 2016

@JanSchulz : I wouldn't have major issues leaving the behaviour as is, for I was in the same camp as you are when I saw this issue as well. We could then just close this PR as well as the issue if there is enough consensus that this is indeed "correct" behaviour.

@jreback ? @jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

I follow @JanSchulz in this one

@gfyoung
Copy link
Member Author

gfyoung commented May 12, 2016

Alright, seems like consensus is to close this PR as well as #12298, which is perfectly fine with me. @jreback (or someone else with write access) can do the honours?

@jreback
Copy link
Contributor

jreback commented May 13, 2016

@gfyoung I tell you what. How about a nice example and explanation in the docs where cross tab is?

We'll close this one and the issue.

@jreback jreback closed this May 13, 2016
@jreback jreback added this to the No action milestone May 13, 2016
@gfyoung
Copy link
Member Author

gfyoung commented May 13, 2016

@jreback : Sounds good. Will do.

@gfyoung gfyoung deleted the crosstab-categorical branch May 13, 2016 00:46
jreback pushed a commit that referenced this pull request May 14, 2016
Follow-on to #13073 by explaining the `Categorical` behaviour in the
documentation.

Author: gfyoung <[email protected]>

Closes #13177 from gfyoung/crosstab-categorical-explain and squashes the following commits:

11ebb94 [gfyoung] DOC: Clarify Categorical Crosstab Behaviour
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants