Skip to content

API/BUG: inconsistent plotting of new CategoricalIndex #10254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Jun 3, 2015 · 26 comments
Open

API/BUG: inconsistent plotting of new CategoricalIndex #10254

jorisvandenbossche opened this issue Jun 3, 2015 · 26 comments
Labels
Bug Categorical Categorical Data Type Visualization plotting

Comments

@jorisvandenbossche
Copy link
Member

To record the issue we discussed yesterday, to be solved for 0.16.2

Plotting of categorical:

  • y-axis: the 'values' itself are used
  • x-axis: they are not regarded as values, but as all unique items that just gets represented by a range(len(cat)) x data

Overview: http://nbviewer.ipython.org/gist/jorisvandenbossche/992d9d34dbfcfd8bc326

Disclaimer: didn't yet look into the code to see why this is like this.

Way forward:

  • for now: handle CategoricalIndex the same as column with Categorical values (so use the values itself)
    • this will also mean that a CategoricalIndex with string categories will raise now (in 0.16.1 it did not but plotted all values just in order, also discarding the fact the values with the same category should be regarded as equal)
  • later, we can try to implement more fancy / intelligent categorical plotting (there are already Feature request: Categorical plotting #9069 and ENH: support Categorical hist plotting #8712 about this)
@jorisvandenbossche
Copy link
Member Author

@TomAugspurger Can you look at this (I won't have time until second half of next week) Or maybe @sinhrks ?

@TomAugspurger
Copy link
Contributor

If I have a chance I will this weekend.

On Jun 3, 2015, at 17:13, Joris Van den Bossche [email protected] wrote:

@TomAugspurger Can you look at this (I won't have time until second half of next week) Or maybe @sinhrks ?


Reply to this email directly or view it on GitHub.

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche we do you think about the following two rules when plotting Categoricals (Series or Index):

  1. position is always determined by code
  2. label is always determined by category

The biggest drawback I see is that even if a Categorical is ordered, the codes may not be ordered. We can get around that by sorting the categorical before plotting, but that could easily be surprising.

@jorisvandenbossche
Copy link
Member Author

Another drawback is that when you have integer categories, this can look odd (eg my example I use in the notebook: with categories [1,2,4] but codes [0,1,2]).
Although I am not really sure that is a drawback. In any case, for a categorical series, that would be a change to how it works at the moment. But in practice, I don't know if you would end up much with integer categories where the intervals between the categories are not equal.
For the original reported issue #10140, this will still mess up the scaling if you use try to adapt your figure based on the values you see.

@TomAugspurger
Copy link
Contributor

I think that even for integer categories the spacing should consistently be one "unit", even if the categories aren't.

FWIW (if I'm doing this correctly) R / ggplot does it this way:

mutate(mtcars, cyl=ifelse(cyl==8, 11, cyl))
  %>% ggplot(aes(factor(cyl))) + geom_bar()

The last one is further away from the others, but since it's a categorical / factor the magnitude of the difference is ignored.
screen shot 2015-06-08 at 8 46 22 am

Agreed that if we use those rules it will be an API change (so wait till 0.17).

@shoyer
Copy link
Member

shoyer commented Jun 9, 2015

  1. position is always determined by code
  2. label is always determined by category

Yes, I think these are definitely the right rules. This is also consistent with stripplot from seaborn (dev):
http://stanford.edu/~mwaskom/software/seaborn-dev/generated/seaborn.stripplot.html#seaborn.stripplot

The biggest drawback I see is that even if a Categorical is ordered, the codes may not be ordered. We can get around that by sorting the categorical before plotting, but that could easily be surprising.

Actually, the way that Categorical works, codes are always in the same order as the categories -- note that pd.Categorical.order works simply by calling np.sort(self._codes). So I don't think we need to sort the categorical first... but maybe I'm misunderstanding something.

@TomAugspurger
Copy link
Contributor

In [1]: s = pd.Series(pd.Categorical([0, 2, 1], ordered=True))

In [2]: s.cat.codes
Out[2]:
0    0
1    2
2    1
dtype: int8

Seems you're correct. I'm not sure what I was seeing yesterday then. That's good. I think this means we push until 0.17 for this?

@TomAugspurger TomAugspurger modified the milestones: 0.17.0, 0.16.2 Jun 9, 2015
@jorisvandenbossche
Copy link
Member Author

I also agree with the rules, the discussion point is more how to put them into place:

  1. When opting for "position is always determined by code" (and as a consequence: not by the value of numerical categories), this will not fix the problems @therriault experienced in Grouping by pd.cut() results creates a categorical index #10140. And this will be an backwards incompatible change.
  2. And if we choose to do this API change, the question is what we do for 0.16.2: a) already do the change b) leave it broken (for CategoricalIndex, not for Categorical/Series) or c) fix the CategoricalIndex to be the same as Categorical/Series, and change it again in 0.17

@TomAugspurger
Copy link
Contributor

I see... I'd vote for b. Since it is an API change, we don't want to break things in a minor version. c doesn't feel good to me since we're putting plotting w/ categoricalIndex into a state that will only exist for a few months before it's really fixed.

@jorisvandenbossche
Copy link
Member Author

yes, that makes a point, but now it is also completely broken (it really doesn't make sense how it is plotted now). And the release will only be the latest one a few months, but a lot of people will maybe use this release a few years ..

@jorisvandenbossche
Copy link
Member Author

Other option would be: d) only change it for CategoricalIndex (broken in 0.16.1 anyway) and leave Categorical/Series as is and only change this in 0.17

@TomAugspurger
Copy link
Contributor

I was just going to suggest that :) It's a bit weird because of the inconsistency between CategoricalIndex v. CategoricalSeries, but I suppose it's better for now.

@TomAugspurger TomAugspurger modified the milestones: 0.16.2, 0.17.0 Jun 9, 2015
@TomAugspurger
Copy link
Contributor

Ok, I'll at least get a PR together for that, hopefully by Wednesday. To summarize

  • As of 0.17 all Categorical plotting will have position determined by codes, labels by categories
  • CategoricalIndex plotting errors entirely right now, so we'll put that behavior in for 0.16.2
  • CategoricalSeries plotting "works" right now for numerical values, w/ position determined by categories instead of codes. We don't want to break API for people relying on that so it stays that way for now (maybe warn?)

@jorisvandenbossche
Copy link
Member Author

One disadvantage of your rules above is that using pandas plotting machinery or matplotlib will result in a different output then:

cat = pd.Series([1,10,100], dtype='category')
plt.plot(cat)
cat.plot()

So these two otherwise rather similar calls will end up differently.

@sinhrks
Copy link
Member

sinhrks commented Jun 10, 2015

Looks good rule for line, area and bar. The below understanding is correct when the rule is applied to others?

  • hist,boxand`pie`` : No difference with normal plot. Use category values as plotting values. Positions are defined unrelated to categories.
  • scatter and hist: No difference with normal plot. Use category values as plotting values (position).

@TomAugspurger
Copy link
Contributor

  • For hist, Categoricals really only make sense as a grouper. You (probably) shouldn't be binning them.
  • For box, again probably just as a grouper.
  • For pie I have no idea. I suppose it'd be the same, just make sure to use the categories as labels and not codes.
  • scatter I think we stick to the same rules: position by code, label by categories. The plot will be very similar to http://stanford.edu/%7Emwaskom/software/seaborn-dev/generated/seaborn.stripplot.html#seaborn.stripplot but w/o the coloring / jittering

I'm still working on this. I have the positioning using codes for most of the plots. Having trouble getting the labels to update properly for the tests.

@TomAugspurger
Copy link
Contributor

Ok... this is more messy than my original rules. Right now, for line plots we don't do any relabeling of x / y-ticks (aside from time series). Between that precedence, the fact that categories don't really make sense for line plots (what's "between" the categories?) and matplotlib's global state so people may plot multiple things to the same axes, I don't think we should support LinePlots with CategoricalIndex / CategoricalSeries. Same goes for Area. I'm not saying we should raise with an Error. We just don't have a good way of breaking the value plotted from the label that's attached to it.

BarPlot absolutely makes sense. I think I have that "working" for a CategoricalIndex. Things are still a bit strange since we don't actually use the .codes.

screen shot 2015-06-11 at 4 10 34 pm

This is consistent with how regular Indexes work with barplots when there are dupes.

@TomAugspurger
Copy link
Contributor

I'm just going around in circles at this point... I think our plotting is fine. The only case I see for plotting a categorical is bar, which already works for CategoricalIndex (regardless of category type).

Maybe I'm missing something.

@shoyer
Copy link
Member

shoyer commented Jun 12, 2015

@TomAugspurger I think line plots make sense for categorical variables if you use style='o' -- then they're basically scatter plots (which also make sense for categoricals). So that's at least two plot types.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

is the conclusion that we don't do anything for 0.16.2 but make an API change for 0.17.0?

@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

@jorisvandenbossche @TomAugspurger moving this to 0.17.0 unless you have something imminent.

@jreback jreback modified the milestones: 0.16.2, 0.17.0 Jun 12, 2015
@TomAugspurger
Copy link
Contributor

Nothing imminent.

On Jun 12, 2015, at 09:02, jreback [email protected] wrote:

@jorisvandenbossche @TomAugspurger moving this to 0.17.0 unless you have something imminent.


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Jul 9, 2016

matplotlib/matplotlib#6689

very exciting by mpl!

this will be in 2.0?
@tacaswell

@jreback
Copy link
Contributor

jreback commented Jul 9, 2016

@jorisvandenbossche @TomAugspurger
@sinhrks

we have a bunch of categorical plotting issues
should create a master issue

@tacaswell
Copy link
Contributor

This is aimed at 2.1

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.0, Next Major Release May 2, 2017
@TomAugspurger
Copy link
Contributor

Well, this is kinda "fixed" since we drop non-numeric data before plotting, and AFAICT, categorical is always considered non-numeric, regardless of the underlying type. This can be improved, but isn't a blocker.

@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Visualization plotting
Projects
None yet
Development

No branches or pull requests

8 participants