Skip to content

ENH: categorical scatter plot #34293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 27, 2021
Merged

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented May 21, 2020

Here's what this would look like:

image

Was just hoping to get initial feedback on whether this would be welcome

Somethings I still need to do:

  • add docstring / annotations to _plot_colorbar
  • test_fash.sh passes locally but there are some CI failures, need to reproduce / figure out why (those tests pass now, failure was probably unrelated)
  • add an example of this functionality to the docs
  • if ints are passed, treat same way as categorical

@WillAyd
Copy link
Member

WillAyd commented May 21, 2020

Is passing a string argument to c non-ambiguous?

@WillAyd WillAyd added the Visualization plotting label May 21, 2020
@MarcoGorelli
Copy link
Member Author

Is passing a string argument to c non-ambiguous?

It's already supported, see this example from the docs:
image

However, currently, it's required that the column be numeric. This PR would add support for the categorical case.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good. A few comments @TomAugspurger

@@ -965,7 +967,10 @@ def _make_plot(self):
elif color is not None:
c_values = color
elif c_is_column:
c_values = self.data[c].values
if not is_categorical_dtype(self.data[c]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but instead of not is_categorical_dtype can you leave this as is_categorcial_dtype and switch the branches?

df["species"] = pd.Categorical(
["setosa", "setosa", "virginica", "virginica", "versicolor"]
)
_check_plot_works(df.plot.scatter, x=0, y=1, c="species")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there way within the test to check the legend is drawn and the appropriate colors are applied?

@charlesdong1991
Copy link
Member

charlesdong1991 commented May 21, 2020

emm, I think c is used to assign values for color points, the usage here looks more like showing labels for scatter points to me? and I am not sure if this will bring confusions to users.

what do you think?

@WillAyd
Copy link
Member

WillAyd commented May 21, 2020

So this covers unordered well - do we need to make any considerations for ordered categoricals as well?

@MarcoGorelli
Copy link
Member Author

emm, I think c is used to assign values for color points, the usage here looks more like showing labels for scatter points to me? and I am not sure if this will bring confusions to users.

what do you think?

This functionality already exists though, see the last example from here where they pass

ax2 = df.plot.scatter(x='length',
                      y='width',
                      c='species',
                      colormap='viridis')

The only difference is that currently, you can only pass c='species' if species is numeric. This PR (once I sort out the errors) would allow you to do c='species' if species is categorical, and then you'd get a discrete legend instead of a continuous colormap.

So this covers unordered well - do we need to make any considerations for ordered categoricals as well?

This should work with both, but in the issue linked above (#12380) they mention using discrete colorbars for ordered categoricals. That's a possibility, though that would make this PR bigger - perhaps it's best to have one PR deal with categoricals, and then another one to split between ordered and unordered?


Motivation for working on this comes from how in ggplot2 you can do

ggplot(data=Iris,aes(x=SepalWidthCm, y=SepalLengthCm,color=Species)) + geom_point()

and it works even if Species isn't numeric

@charlesdong1991
Copy link
Member

charlesdong1991 commented May 21, 2020

This functionality already exists though

yeah, I know the functionality does exist, but based on the definition:

A column name or position whose values will be used to color the marker points according to a colormap.

Using numeric values is because they are used to select colors from colormap. but if uses categorical values, i think it will only use the first N colors from colormap (N is the number of categories). So this doc should be changed and well documented to avoid confusion I think.

And @WillAyd has a good point on sorted categorical values, and in this case, the colors selected from colormap are not the first N, but based on the order, so would be nice to test for those cases.

@MarcoGorelli
Copy link
Member Author

@charlesdong1991 ah I see what you mean now, thanks for clarifying. Yes, that's a good point, will think about this

@charlesdong1991
Copy link
Member

charlesdong1991 commented May 21, 2020

you are welcome @MarcoGorelli

I think if to keep PR small (and get merged faster ^^) and to support categorical values for c, I would suggest focusing on supporting sorted categorical values first, because it's very natural thinking that the order can be considered as a way to select differnt colors from colormap.

sorry, nvm, this PR has supported unordered case, so above does not make sense, sorry again for the noise

@MarcoGorelli
Copy link
Member Author

Here's what I'm currently working on: if we start with

import matplotlib as mpl
import numpy as np

import pandas as pd

CMAP = "viridis"
df = pd.DataFrame(
    [[5.1, 3.5], [4.9, 3.0], [7.0, 3.2], [6.4, 3.2], [5.9, 3.0]],
    columns=["length", "width"],
)

then we get the following:

  • unordered categorical
    image

  • ordered categorical
    image

  • numeric (no change to current behaviour)
    image

Will try to get a commit in with tests by the end of the day, but if the above screenshots indicate I'm on the wrong path please do let me know

@charlesdong1991
Copy link
Member

charlesdong1991 commented May 22, 2020

I think, the second plot looks pretty good, and should be what it looks like if supporting categorical values.

And I have mixed feeling about the first one, from a user perspective, this will bring confusions to users because the behaviour of unsorted categorical values are different than others for c arguement, and also, the color bar is gone (currently, they should expect to see colorbar if they explicitly assign values).

So IMHO, for categorical features, sorted and unsorted, they should both have the second plot look. And the difference between those two, will be what colors the categories will represent in colormap based on the order.

@jorisvandenbossche
Copy link
Member

I think it would in general be very nice to support "categorical" data (whether it is actually categorical dtype, or with non-numerical dtype) in a scatter plot.

But some comments:

  • IMO, we should not use a colorbar, as this is typically for a numerical scale (whether continuous or discrete colors). So I think we should rather use a legend than a colorbar (like the first plot in ENH: categorical scatter plot #34293 (comment)). But @charlesdong1991 do I understand you correctly that you would prefer the colorbar in all cases?
  • We should also not use a numerical colormap (like viridis), but rather the discrete color cycle from matplotlib in this case, I think?

@charlesdong1991
Copy link
Member

charlesdong1991 commented May 22, 2020

@jorisvandenbossche I agree that it's also an option to have legend than colorbar for categorical values.

But the main concern is consistency of the meaning and expectation of c, right now, they assign values (either array or column name or other allowed input), it will look for colors from colormap or specific colors and has colorbar to show out. But would be weird to users if the categorical features, they see legend.

So, if using legend, this change will need to be well documented for c argument to reduce confusion, and will also probably impact colormap. And maybe should also think if we should assign it to c or not, because it is plotting legends, if so, probably assigning to label sounds more aligned with other plots api.

@charlesdong1991
Copy link
Member

emm, sorry, i think maybe legend looks a bit better in case of categorical values. i just tried out to use specific discrete colors to assign to c using the example here, and actually the plot does not look nice. Not sure if it is better to have legends or to have discrete colors in color bar here.

Screen Shot 2020-05-22 at 4 48 26 PM

The color bar is there but does not make sense, so probably we should make changes for such cases too (another PR though).

@MarcoGorelli
Copy link
Member Author

i think maybe legend looks a bit better in case of categorical values

Is that for both ordered and unordered cases?

The color bar is there but does not make sense, so probably we should make changes for such cases too (another PR though).

OK, have opened #34316.

@charlesdong1991
Copy link
Member

charlesdong1991 commented May 22, 2020

In your last example, the column passed to c is a column with color names, and we use those to color the points

yeah, right now, c accepts quite wide range of cases, and color referred by names is one of the allowed cases. It's allowing color name, or color code, or a list of them, or column name/loc whose numeric values which points to the loc of colors in the colormap.

distinguish the case where the passed "categorical" values are 1) color names or

I think color names and color codes are natively and widely supported in matplotlib, so not sure if we even need to put efforts distinguishing it, in other words, no matter if it is already categorical type or general object type, they should be plotted correctly. Therefore, I think, even without this PR, the code below should also work (haven't tried out though, might wrong):

df = pd.DataFrame(
    [[5.1, 3.5], [4.9, 3.0], [7.0, 3.2], [6.4, 3.2], [5.9, 3.0]],
    columns=["length", "width"],
)

df['specifics'] = pd.Categorical(['r', 'b', 'b', 'r', 'g'])
df.plot.scatter(x=0, y=1, c='specifics')
  1. discrete non-numeric values that we want to use default colors for

I, as a user, the c is about colors, and that's what we want to use to color the points, that's why I had some initial concerns about what we should accept here.

I think the feature @MarcoGorelli wants to implement here is when categorical values are not color names which matplotlib cannot recognise, e.g. random category names. And we need to figure out how they could map to colors. Therefore, to me, supporting sorted categorical values are more intuitive, because order can be viewed as discrete numbers, and we could use them to find color in colormap (that's how they use numeric values to find colors, but difference is now in this case, the numeric values are discrete, so its kind of a special case of numeric values). Thus, I would prefer to see discrete names in color bar to be plotted (as in his second plot), which could presents the link between the random category name and the colormap. And for unsorted categorical values, we could just treat them as the order is the order of appearance, like how we do to unsorted objects.

So above is the main reason why I would prefer to have discrete values in colormap for categorical cases than having them as legends (which also might bring in ambiguity to users) although they might not look as nice as legends in the plot

Sorry to put so many words, and will be happy to discuss if you have further questions. @jorisvandenbossche

@MarcoGorelli
Copy link
Member Author

I think the feature @MarcoGorelli wants to implement here is when categorical values are not color names which matplotlib cannot recognise, e.g. random category names. And we need to figure out how they could map to colors.

Yes, exactly - I was thinking that the mapping could be done using the codes attribute.

Anyway, I'll wait for further direction about whether a legend or a discrete colorbar would be preferred before adding further commits - thanks for the discussion/input so far!

@MarcoGorelli
Copy link
Member Author

Hi all - any update on whether pandas would like to go ahead with this, and whether a discrete colorbar or a legend would be preferable?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable to me, can you add a whatsnew note in other enhancements for 1.3

also may pay to add an example in visualization.rst

@@ -440,7 +441,9 @@ def _compute_plot_data(self):
if is_empty:
raise TypeError("no numeric data to plot")

self.data = numeric_data.apply(self._convert_to_ndarray)
self.data = numeric_data.apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just change _convert_to_ndarray to handle this case? (and mention in the doc-string)

@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

cc @charlesdong1991 @ivanovmg @TomAugspurger if any comments.

Copy link
Member

@ivanovmg ivanovmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Couple of comments from my side.

Comment on lines 978 to -977
# The workaround below is no longer necessary.
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment in the line above still relevant?

# The workaround below is no longer necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes, it's just saying that

        points = ax.get_position().get_points()
        cbar_points = cbar.ax.get_position().get_points()

        cbar.ax.set_position(
            [
                cbar_points[0, 0],
                points[0, 1],
                cbar_points[1, 0] - cbar_points[0, 0],
                points[1, 1] - points[0, 1],
            ]
        )
        # To see the discrepancy in axis heights uncomment
        # the following two lines:
        # print(points[1, 1] - points[0, 1])
        # print(cbar_points[1, 1] - cbar_points[0, 1])

        return cbar

is no longer necessary if mpl is of a modern enough version

Comment on lines 1033 to 1037
elif c_is_column:
c_values = self.data[c].values
if color_by_categorical:
c_values = self.data[c].cat.codes
else:
c_values = self.data[c].values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe slightly modify the logic here?

...
elif color_by_categorical:
  c_values = self.data[c].cat.codes
elif c_is_column:
  c_values = self.data[c].values
else:
  c_values = c

This allows one to eliminate the inner if-else statement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch, thank you!

Comment on lines +699 to +704
@pytest.mark.parametrize("ordered", [True, False])
@pytest.mark.parametrize(
"categories",
(["setosa", "versicolor", "virginica"], ["versicolor", "virginica", "setosa"]),
)
def test_scatterplot_color_by_categorical(self, ordered, categories):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to add assertions on the colorbar ticklabels for the ordered case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a simple way to do this using the public API from mpl, but can look into it further (or do you have a suggestion?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I look at my comment once again and I cannot figure out what kind of assertions I asked you to consider.
You already do check the ticklabels at the colorbar against the expectations.
Sorry for the noise.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jan 17, 2021

Thanks for reviews, have updated.

Looking through the thread, I think there's still no consensus on what to plot in the unordered categorical case.

Legend (as in the first plot from #34293 (comment)), or colorbar (so, no difference to the unordered categorical case)? The current set of commits has the latter, i.e.
image


IMO a discrete colorbar in both the ordered and unordered cases minimises the risk of confusion, and I think it looks alright

@ivanovmg
Copy link
Member

+1 on the colorbar option for both ordered and unordered.
I guess it will be more clear to the users.

@MarcoGorelli
Copy link
Member Author

Cool, thanks

also may pay to add an example in visualization.rst

Agreed, though it's the kind of thing people like to contribute, so if it's OK I'd rather open it as a good first issue (once this PR's in) and mentor someone through it

@jreback jreback added this to the 1.3 milestone Jan 25, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the visualization.rst with an example of using this

@@ -52,6 +52,7 @@ Other enhancements
- :meth:`DataFrame.apply` can now accept NumPy unary operators as strings, e.g. ``df.apply("sqrt")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
- :meth:`DataFrame.apply` can now accept non-callable DataFrame properties as strings, e.g. ``df.apply("size")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
- :meth:`Series.apply` can now accept list-like or dictionary-like arguments that aren't lists or dictionaries, e.g. ``ser.apply(np.array(["sum", "mean"]))``, which was already the case for :meth:`DataFrame.apply` (:issue:`39140`)
- :meth:`DataFrame.plot.scatter` can now accept a categorical column as the argument to ``c`` (:issue:`31357`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's mention the other issue here (and just close it too)

@MarcoGorelli
Copy link
Member Author

Sure, added - here's what the new part in visualisation.rst looks like:

image

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one tiny comment. ping on green

@@ -579,6 +582,19 @@ each point:
df.plot.scatter(x="a", y="b", c="c", s=50);


.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a versionadded tag 1.3 here

@jreback jreback merged commit d0cfa03 into pandas-dev:master Jan 27, 2021
@jreback
Copy link
Contributor

jreback commented Jan 27, 2021

thanks @MarcoGorelli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using matplotlib scatter legends? Scatter plot should have a discrete colorbar when 'c' is integer
7 participants