API/BUG: inconsistent plotting of new CategoricalIndex #10254

jorisvandenbossche · 2015-06-03T08:06:49Z

To record the issue we discussed yesterday, to be solved for 0.16.2

Plotting of categorical:

y-axis: the 'values' itself are used
x-axis: they are not regarded as values, but as all unique items that just gets represented by a range(len(cat)) x data

Overview: http://nbviewer.ipython.org/gist/jorisvandenbossche/992d9d34dbfcfd8bc326

Disclaimer: didn't yet look into the code to see why this is like this.

Way forward:

for now: handle CategoricalIndex the same as column with Categorical values (so use the values itself)
- this will also mean that a CategoricalIndex with string categories will raise now (in 0.16.1 it did not but plotted all values just in order, also discarding the fact the values with the same category should be regarded as equal)
later, we can try to implement more fancy / intelligent categorical plotting (there are already Feature request: Categorical plotting #9069 and ENH: support Categorical hist plotting #8712 about this)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2015-06-03T22:12:55Z

@TomAugspurger Can you look at this (I won't have time until second half of next week) Or maybe @sinhrks ?

TomAugspurger · 2015-06-04T00:13:27Z

If I have a chance I will this weekend.

On Jun 3, 2015, at 17:13, Joris Van den Bossche [email protected] wrote:

@TomAugspurger Can you look at this (I won't have time until second half of next week) Or maybe @sinhrks ?

—
Reply to this email directly or view it on GitHub.

TomAugspurger · 2015-06-08T12:54:55Z

@jorisvandenbossche we do you think about the following two rules when plotting Categoricals (Series or Index):

position is always determined by code
label is always determined by category

The biggest drawback I see is that even if a Categorical is ordered, the codes may not be ordered. We can get around that by sorting the categorical before plotting, but that could easily be surprising.

jorisvandenbossche · 2015-06-08T13:05:35Z

Another drawback is that when you have integer categories, this can look odd (eg my example I use in the notebook: with categories [1,2,4] but codes [0,1,2]).
Although I am not really sure that is a drawback. In any case, for a categorical series, that would be a change to how it works at the moment. But in practice, I don't know if you would end up much with integer categories where the intervals between the categories are not equal.
For the original reported issue #10140, this will still mess up the scaling if you use try to adapt your figure based on the values you see.

TomAugspurger · 2015-06-08T13:47:22Z

I think that even for integer categories the spacing should consistently be one "unit", even if the categories aren't.

FWIW (if I'm doing this correctly) R / ggplot does it this way:

mutate(mtcars, cyl=ifelse(cyl==8, 11, cyl))
  %>% ggplot(aes(factor(cyl))) + geom_bar()

The last one is further away from the others, but since it's a categorical / factor the magnitude of the difference is ignored.

Agreed that if we use those rules it will be an API change (so wait till 0.17).

shoyer · 2015-06-09T05:28:01Z

position is always determined by code

label is always determined by category

Yes, I think these are definitely the right rules. This is also consistent with stripplot from seaborn (dev):
http://stanford.edu/~mwaskom/software/seaborn-dev/generated/seaborn.stripplot.html#seaborn.stripplot

The biggest drawback I see is that even if a Categorical is ordered, the codes may not be ordered. We can get around that by sorting the categorical before plotting, but that could easily be surprising.

Actually, the way that Categorical works, codes are always in the same order as the categories -- note that pd.Categorical.order works simply by calling np.sort(self._codes). So I don't think we need to sort the categorical first... but maybe I'm misunderstanding something.

TomAugspurger · 2015-06-09T12:32:55Z

In [1]: s = pd.Series(pd.Categorical([0, 2, 1], ordered=True))

In [2]: s.cat.codes
Out[2]:
0    0
1    2
2    1
dtype: int8

Seems you're correct. I'm not sure what I was seeing yesterday then. That's good. I think this means we push until 0.17 for this?

jorisvandenbossche · 2015-06-09T12:34:14Z

I also agree with the rules, the discussion point is more how to put them into place:

When opting for "position is always determined by code" (and as a consequence: not by the value of numerical categories), this will not fix the problems @therriault experienced in Grouping by pd.cut() results creates a categorical index #10140. And this will be an backwards incompatible change.
And if we choose to do this API change, the question is what we do for 0.16.2: a) already do the change b) leave it broken (for CategoricalIndex, not for Categorical/Series) or c) fix the CategoricalIndex to be the same as Categorical/Series, and change it again in 0.17

TomAugspurger · 2015-06-09T12:38:33Z

I see... I'd vote for b. Since it is an API change, we don't want to break things in a minor version. c doesn't feel good to me since we're putting plotting w/ categoricalIndex into a state that will only exist for a few months before it's really fixed.

jorisvandenbossche · 2015-06-09T12:40:34Z

yes, that makes a point, but now it is also completely broken (it really doesn't make sense how it is plotted now). And the release will only be the latest one a few months, but a lot of people will maybe use this release a few years ..

jorisvandenbossche · 2015-06-09T12:41:46Z

Other option would be: d) only change it for CategoricalIndex (broken in 0.16.1 anyway) and leave Categorical/Series as is and only change this in 0.17

TomAugspurger · 2015-06-09T12:42:47Z

I was just going to suggest that :) It's a bit weird because of the inconsistency between CategoricalIndex v. CategoricalSeries, but I suppose it's better for now.

TomAugspurger · 2015-06-09T12:46:19Z

Ok, I'll at least get a PR together for that, hopefully by Wednesday. To summarize

As of 0.17 all Categorical plotting will have position determined by codes, labels by categories
CategoricalIndex plotting errors entirely right now, so we'll put that behavior in for 0.16.2
CategoricalSeries plotting "works" right now for numerical values, w/ position determined by categories instead of codes. We don't want to break API for people relying on that so it stays that way for now (maybe warn?)

jorisvandenbossche · 2015-06-09T12:46:47Z

One disadvantage of your rules above is that using pandas plotting machinery or matplotlib will result in a different output then:

cat = pd.Series([1,10,100], dtype='category')
plt.plot(cat)
cat.plot()

So these two otherwise rather similar calls will end up differently.

sinhrks · 2015-06-10T12:34:31Z

Looks good rule for line, area and bar. The below understanding is correct when the rule is applied to others?

hist,boxand`pie`` : No difference with normal plot. Use category values as plotting values. Positions are defined unrelated to categories.
scatter and hist: No difference with normal plot. Use category values as plotting values (position).

TomAugspurger · 2015-06-10T12:58:37Z

For hist, Categoricals really only make sense as a grouper. You (probably) shouldn't be binning them.
For box, again probably just as a grouper.
For pie I have no idea. I suppose it'd be the same, just make sure to use the categories as labels and not codes.
scatter I think we stick to the same rules: position by code, label by categories. The plot will be very similar to http://stanford.edu/%7Emwaskom/software/seaborn-dev/generated/seaborn.stripplot.html#seaborn.stripplot but w/o the coloring / jittering

I'm still working on this. I have the positioning using codes for most of the plots. Having trouble getting the labels to update properly for the tests.

TomAugspurger · 2015-06-11T21:18:12Z

Ok... this is more messy than my original rules. Right now, for line plots we don't do any relabeling of x / y-ticks (aside from time series). Between that precedence, the fact that categories don't really make sense for line plots (what's "between" the categories?) and matplotlib's global state so people may plot multiple things to the same axes, I don't think we should support LinePlots with CategoricalIndex / CategoricalSeries. Same goes for Area. I'm not saying we should raise with an Error. We just don't have a good way of breaking the value plotted from the label that's attached to it.

BarPlot absolutely makes sense. I think I have that "working" for a CategoricalIndex. Things are still a bit strange since we don't actually use the .codes.

This is consistent with how regular Indexes work with barplots when there are dupes.

TomAugspurger · 2015-06-12T02:14:42Z

I'm just going around in circles at this point... I think our plotting is fine. The only case I see for plotting a categorical is bar, which already works for CategoricalIndex (regardless of category type).

Maybe I'm missing something.

shoyer · 2015-06-12T02:46:42Z

@TomAugspurger I think line plots make sense for categorical variables if you use style='o' -- then they're basically scatter plots (which also make sense for categoricals). So that's at least two plot types.

jreback · 2015-06-12T03:39:33Z

is the conclusion that we don't do anything for 0.16.2 but make an API change for 0.17.0?

jreback · 2015-06-12T14:02:53Z

@jorisvandenbossche @TomAugspurger moving this to 0.17.0 unless you have something imminent.

TomAugspurger · 2015-06-12T14:37:36Z

Nothing imminent.

On Jun 12, 2015, at 09:02, jreback [email protected] wrote:

@jorisvandenbossche @TomAugspurger moving this to 0.17.0 unless you have something imminent.

—
Reply to this email directly or view it on GitHub.

jreback · 2016-07-09T20:23:00Z

matplotlib/matplotlib#6689

very exciting by mpl!

this will be in 2.0?
@tacaswell

jreback · 2016-07-09T20:24:15Z

@jorisvandenbossche @TomAugspurger
@sinhrks

we have a bunch of categorical plotting issues
should create a master issue

tacaswell · 2016-07-10T02:00:27Z

This is aimed at 2.1

TomAugspurger · 2019-12-30T14:09:59Z

Well, this is kinda "fixed" since we drop non-numeric data before plotting, and AFAICT, categorical is always considered non-numeric, regardless of the underlying type. This can be improved, but isn't a blocker.

jorisvandenbossche added Bug API Design Categorical Categorical Data Type labels Jun 3, 2015

jorisvandenbossche added this to the 0.16.2 milestone Jun 3, 2015

jorisvandenbossche added the Visualization plotting label Jun 3, 2015

jorisvandenbossche mentioned this issue Jun 3, 2015

REGR: ensure passed binlabels to pd.cut have a compat dtype on output (#10140) #10252

Closed

TomAugspurger modified the milestones: 0.17.0, 0.16.2 Jun 9, 2015

TomAugspurger modified the milestones: 0.16.2, 0.17.0 Jun 9, 2015

jreback modified the milestones: 0.16.2, 0.17.0 Jun 12, 2015

jreback mentioned this issue Jun 20, 2015

Grouping by pd.cut() results creates a categorical index #10140

Closed

jreback modified the milestones: Next Major Release, 0.17.0 Aug 19, 2015

jreback added the Prio-medium label Aug 19, 2015

themrmax mentioned this issue Nov 18, 2015

Series.hist() fails for String Series #5876

Closed

jorisvandenbossche modified the milestones: 1.0, Next Major Release May 2, 2017

jbrockmendel removed the Prio-medium label Oct 21, 2019

TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019

mroeschke removed the API Design label Apr 18, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/BUG: inconsistent plotting of new CategoricalIndex #10254

API/BUG: inconsistent plotting of new CategoricalIndex #10254

jorisvandenbossche commented Jun 3, 2015

jorisvandenbossche commented Jun 3, 2015

TomAugspurger commented Jun 4, 2015

TomAugspurger commented Jun 8, 2015

jorisvandenbossche commented Jun 8, 2015

TomAugspurger commented Jun 8, 2015

shoyer commented Jun 9, 2015

TomAugspurger commented Jun 9, 2015

jorisvandenbossche commented Jun 9, 2015

TomAugspurger commented Jun 9, 2015

jorisvandenbossche commented Jun 9, 2015

jorisvandenbossche commented Jun 9, 2015

TomAugspurger commented Jun 9, 2015

TomAugspurger commented Jun 9, 2015

jorisvandenbossche commented Jun 9, 2015

sinhrks commented Jun 10, 2015

TomAugspurger commented Jun 10, 2015

TomAugspurger commented Jun 11, 2015

TomAugspurger commented Jun 12, 2015

shoyer commented Jun 12, 2015

jreback commented Jun 12, 2015

jreback commented Jun 12, 2015

TomAugspurger commented Jun 12, 2015

jreback commented Jul 9, 2016

jreback commented Jul 9, 2016

tacaswell commented Jul 10, 2016

TomAugspurger commented Dec 30, 2019

API/BUG: inconsistent plotting of new CategoricalIndex #10254

API/BUG: inconsistent plotting of new CategoricalIndex #10254

Comments

jorisvandenbossche commented Jun 3, 2015

jorisvandenbossche commented Jun 3, 2015

TomAugspurger commented Jun 4, 2015

TomAugspurger commented Jun 8, 2015

jorisvandenbossche commented Jun 8, 2015

TomAugspurger commented Jun 8, 2015

shoyer commented Jun 9, 2015

TomAugspurger commented Jun 9, 2015

jorisvandenbossche commented Jun 9, 2015

TomAugspurger commented Jun 9, 2015

jorisvandenbossche commented Jun 9, 2015

jorisvandenbossche commented Jun 9, 2015

TomAugspurger commented Jun 9, 2015

TomAugspurger commented Jun 9, 2015

jorisvandenbossche commented Jun 9, 2015

sinhrks commented Jun 10, 2015

TomAugspurger commented Jun 10, 2015

TomAugspurger commented Jun 11, 2015

TomAugspurger commented Jun 12, 2015

shoyer commented Jun 12, 2015

jreback commented Jun 12, 2015

jreback commented Jun 12, 2015

TomAugspurger commented Jun 12, 2015

jreback commented Jul 9, 2016

jreback commented Jul 9, 2016

tacaswell commented Jul 10, 2016

TomAugspurger commented Dec 30, 2019