ENH: categorical scatter plot #34293

MarcoGorelli · 2020-05-21T15:16:01Z

closes Scatter plot should have a discrete colorbar when 'c' is integer #12380 (this idea was mentioned as a comment here). Also, closes Using matplotlib scatter legends? #31357 (I hadn't noticed found that issue when I first opened this)
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Here's what this would look like:

Was just hoping to get initial feedback on whether this would be welcome

Somethings I still need to do:

add docstring / annotations to _plot_colorbar
test_fash.sh passes locally but there are some CI failures, need to reproduce / figure out why (those tests pass now, failure was probably unrelated)
add an example of this functionality to the docs
if ints are passed, treat same way as categorical

WillAyd · 2020-05-21T16:03:56Z

Is passing a string argument to c non-ambiguous?

MarcoGorelli · 2020-05-21T16:07:40Z

Is passing a string argument to c non-ambiguous?

It's already supported, see this example from the docs:

However, currently, it's required that the column be numeric. This PR would add support for the categorical case.

WillAyd

Looks generally good. A few comments @TomAugspurger

WillAyd · 2020-05-21T17:31:48Z

pandas/plotting/_matplotlib/core.py

@@ -965,7 +967,10 @@ def _make_plot(self):
        elif color is not None:
            c_values = color
        elif c_is_column:
-            c_values = self.data[c].values
+            if not is_categorical_dtype(self.data[c]):


nit but instead of not is_categorical_dtype can you leave this as is_categorcial_dtype and switch the branches?

WillAyd · 2020-05-21T17:35:38Z

pandas/tests/plotting/test_frame.py

+        df["species"] = pd.Categorical(
+            ["setosa", "setosa", "virginica", "virginica", "versicolor"]
+        )
+        _check_plot_works(df.plot.scatter, x=0, y=1, c="species")


Is there way within the test to check the legend is drawn and the appropriate colors are applied?

charlesdong1991 · 2020-05-21T17:36:32Z

emm, I think c is used to assign values for color points, the usage here looks more like showing labels for scatter points to me? and I am not sure if this will bring confusions to users.

what do you think?

WillAyd · 2020-05-21T17:37:00Z

So this covers unordered well - do we need to make any considerations for ordered categoricals as well?

MarcoGorelli · 2020-05-21T18:13:55Z

emm, I think c is used to assign values for color points, the usage here looks more like showing labels for scatter points to me? and I am not sure if this will bring confusions to users.

what do you think?

This functionality already exists though, see the last example from here where they pass

ax2 = df.plot.scatter(x='length',
                      y='width',
                      c='species',
                      colormap='viridis')

The only difference is that currently, you can only pass c='species' if species is numeric. This PR (once I sort out the errors) would allow you to do c='species' if species is categorical, and then you'd get a discrete legend instead of a continuous colormap.

So this covers unordered well - do we need to make any considerations for ordered categoricals as well?

This should work with both, but in the issue linked above (#12380) they mention using discrete colorbars for ordered categoricals. That's a possibility, though that would make this PR bigger - perhaps it's best to have one PR deal with categoricals, and then another one to split between ordered and unordered?

Motivation for working on this comes from how in ggplot2 you can do

ggplot(data=Iris,aes(x=SepalWidthCm, y=SepalLengthCm,color=Species)) + geom_point()

and it works even if Species isn't numeric

charlesdong1991 · 2020-05-21T18:26:10Z

This functionality already exists though

yeah, I know the functionality does exist, but based on the definition:

A column name or position whose values will be used to color the marker points according to a colormap.

Using numeric values is because they are used to select colors from colormap. but if uses categorical values, i think it will only use the first N colors from colormap (N is the number of categories). So this doc should be changed and well documented to avoid confusion I think.

And @WillAyd has a good point on sorted categorical values, and in this case, the colors selected from colormap are not the first N, but based on the order, so would be nice to test for those cases.

MarcoGorelli · 2020-05-21T18:28:30Z

@charlesdong1991 ah I see what you mean now, thanks for clarifying. Yes, that's a good point, will think about this

charlesdong1991 · 2020-05-21T19:03:30Z

you are welcome @MarcoGorelli

~~I think if to keep PR small (and get merged faster ^^) and to support categorical values for c~~, I would suggest focusing on supporting sorted categorical values first, because it's very natural thinking that the order can be considered as a way to select differnt colors from colormap.

sorry, nvm, this PR has supported unordered case, so above does not make sense, sorry again for the noise

MarcoGorelli · 2020-05-22T10:25:30Z

Here's what I'm currently working on: if we start with

import matplotlib as mpl
import numpy as np

import pandas as pd

CMAP = "viridis"
df = pd.DataFrame(
    [[5.1, 3.5], [4.9, 3.0], [7.0, 3.2], [6.4, 3.2], [5.9, 3.0]],
    columns=["length", "width"],
)

then we get the following:

unordered categorical
ordered categorical
numeric (no change to current behaviour)

Will try to get a commit in with tests by the end of the day, but if the above screenshots indicate I'm on the wrong path please do let me know

charlesdong1991 · 2020-05-22T11:36:59Z

I think, the second plot looks pretty good, and should be what it looks like if supporting categorical values.

And I have mixed feeling about the first one, from a user perspective, this will bring confusions to users because the behaviour of unsorted categorical values are different than others for c arguement, and also, the color bar is gone (currently, they should expect to see colorbar if they explicitly assign values).

So IMHO, for categorical features, sorted and unsorted, they should both have the second plot look. And the difference between those two, will be what colors the categories will represent in colormap based on the order.

jorisvandenbossche · 2020-05-22T14:19:10Z

I think it would in general be very nice to support "categorical" data (whether it is actually categorical dtype, or with non-numerical dtype) in a scatter plot.

But some comments:

IMO, we should not use a colorbar, as this is typically for a numerical scale (whether continuous or discrete colors). So I think we should rather use a legend than a colorbar (like the first plot in ENH: categorical scatter plot #34293 (comment)). But @charlesdong1991 do I understand you correctly that you would prefer the colorbar in all cases?
We should also not use a numerical colormap (like viridis), but rather the discrete color cycle from matplotlib in this case, I think?

charlesdong1991 · 2020-05-22T14:40:41Z

@jorisvandenbossche I agree that it's also an option to have legend than colorbar for categorical values.

But the main concern is consistency of the meaning and expectation of c, right now, they assign values (either array or column name or other allowed input), it will look for colors from colormap or specific colors and has colorbar to show out. But would be weird to users if the categorical features, they see legend.

So, if using legend, this change will need to be well documented for c argument to reduce confusion, and will also probably impact colormap. And maybe should also think if we should assign it to c or not, because it is plotting legends, if so, probably assigning to label sounds more aligned with other plots api.

charlesdong1991 · 2020-05-22T14:54:07Z

emm, sorry, i think maybe legend looks a bit better in case of categorical values. i just tried out to use specific discrete colors to assign to c using the example here, and actually the plot does not look nice. Not sure if it is better to have legends or to have discrete colors in color bar here.

The color bar is there but does not make sense, so probably we should make changes for such cases too (another PR though).

MarcoGorelli · 2020-05-22T15:13:08Z

i think maybe legend looks a bit better in case of categorical values

Is that for both ordered and unordered cases?

The color bar is there but does not make sense, so probably we should make changes for such cases too (another PR though).

OK, have opened #34316.

charlesdong1991 · 2020-05-22T18:59:17Z

In your last example, the column passed to c is a column with color names, and we use those to color the points

yeah, right now, c accepts quite wide range of cases, and color referred by names is one of the allowed cases. It's allowing color name, or color code, or a list of them, or column name/loc whose numeric values which points to the loc of colors in the colormap.

distinguish the case where the passed "categorical" values are 1) color names or

I think color names and color codes are natively and widely supported in matplotlib, so not sure if we even need to put efforts distinguishing it, in other words, no matter if it is already categorical type or general object type, they should be plotted correctly. Therefore, I think, even without this PR, the code below should also work (haven't tried out though, might wrong):

df = pd.DataFrame(
    [[5.1, 3.5], [4.9, 3.0], [7.0, 3.2], [6.4, 3.2], [5.9, 3.0]],
    columns=["length", "width"],
)

df['specifics'] = pd.Categorical(['r', 'b', 'b', 'r', 'g'])
df.plot.scatter(x=0, y=1, c='specifics')

discrete non-numeric values that we want to use default colors for

I, as a user, the c is about colors, and that's what we want to use to color the points, that's why I had some initial concerns about what we should accept here.

I think the feature @MarcoGorelli wants to implement here is when categorical values are not color names which matplotlib cannot recognise, e.g. random category names. And we need to figure out how they could map to colors. Therefore, to me, supporting sorted categorical values are more intuitive, because order can be viewed as discrete numbers, and we could use them to find color in colormap (that's how they use numeric values to find colors, but difference is now in this case, the numeric values are discrete, so its kind of a special case of numeric values). Thus, I would prefer to see discrete names in color bar to be plotted (as in his second plot), which could presents the link between the random category name and the colormap. And for unsorted categorical values, we could just treat them as the order is the order of appearance, like how we do to unsorted objects.

So above is the main reason why I would prefer to have discrete values in colormap for categorical cases than having them as legends (which also might bring in ambiguity to users) although they might not look as nice as legends in the plot

Sorry to put so many words, and will be happy to discuss if you have further questions. @jorisvandenbossche

MarcoGorelli · 2020-05-24T10:03:50Z

I think the feature @MarcoGorelli wants to implement here is when categorical values are not color names which matplotlib cannot recognise, e.g. random category names. And we need to figure out how they could map to colors.

Yes, exactly - I was thinking that the mapping could be done using the codes attribute.

Anyway, I'll wait for further direction about whether a legend or a discrete colorbar would be preferred before adding further commits - thanks for the discussion/input so far!

MarcoGorelli · 2020-07-05T07:42:36Z

Hi all - any update on whether pandas would like to go ahead with this, and whether a discrete colorbar or a legend would be preferable?

jreback

looks reasonable to me, can you add a whatsnew note in other enhancements for 1.3

also may pay to add an example in visualization.rst

jreback · 2021-01-03T18:15:15Z

pandas/plotting/_matplotlib/core.py

@@ -440,7 +441,9 @@ def _compute_plot_data(self):
        if is_empty:
            raise TypeError("no numeric data to plot")

-        self.data = numeric_data.apply(self._convert_to_ndarray)
+        self.data = numeric_data.apply(


can you just change _convert_to_ndarray to handle this case? (and mention in the doc-string)

jreback · 2021-01-03T18:22:10Z

cc @charlesdong1991 @ivanovmg @TomAugspurger if any comments.

ivanovmg

Looks good to me.
Couple of comments from my side.

ivanovmg · 2021-01-11T11:29:51Z

pandas/plotting/_matplotlib/core.py

            # The workaround below is no longer necessary.
-            return


Is the comment in the line above still relevant?

# The workaround below is no longer necessary.

I think so, yes, it's just saying that

points = ax.get_position().get_points() cbar_points = cbar.ax.get_position().get_points() cbar.ax.set_position( [ cbar_points[0, 0], points[0, 1], cbar_points[1, 0] - cbar_points[0, 0], points[1, 1] - points[0, 1], ] ) # To see the discrepancy in axis heights uncomment # the following two lines: # print(points[1, 1] - points[0, 1]) # print(cbar_points[1, 1] - cbar_points[0, 1]) return cbar

is no longer necessary if mpl is of a modern enough version

ivanovmg · 2021-01-11T11:38:32Z

pandas/plotting/_matplotlib/core.py

        elif c_is_column:
-            c_values = self.data[c].values
+            if color_by_categorical:
+                c_values = self.data[c].cat.codes
+            else:
+                c_values = self.data[c].values


Maybe slightly modify the logic here?

... elif color_by_categorical: c_values = self.data[c].cat.codes elif c_is_column: c_values = self.data[c].values else: c_values = c

This allows one to eliminate the inner if-else statement.

yes, good catch, thank you!

ivanovmg · 2021-01-11T11:41:40Z

pandas/tests/plotting/frame/test_frame.py

+    @pytest.mark.parametrize("ordered", [True, False])
+    @pytest.mark.parametrize(
+        "categories",
+        (["setosa", "versicolor", "virginica"], ["versicolor", "virginica", "setosa"]),
+    )
+    def test_scatterplot_color_by_categorical(self, ordered, categories):


Is there a way to add assertions on the colorbar ticklabels for the ordered case?

I couldn't find a simple way to do this using the public API from mpl, but can look into it further (or do you have a suggestion?)

I look at my comment once again and I cannot figure out what kind of assertions I asked you to consider.
You already do check the ticklabels at the colorbar against the expectations.
Sorry for the noise.

MarcoGorelli · 2021-01-17T11:27:04Z

Thanks for reviews, have updated.

Looking through the thread, I think there's still no consensus on what to plot in the unordered categorical case.

Legend (as in the first plot from #34293 (comment)), or colorbar (so, no difference to the unordered categorical case)? The current set of commits has the latter, i.e.

IMO a discrete colorbar in both the ordered and unordered cases minimises the risk of confusion, and I think it looks alright

ivanovmg · 2021-01-18T10:25:27Z

+1 on the colorbar option for both ordered and unordered.
I guess it will be more clear to the users.

MarcoGorelli · 2021-01-19T08:27:29Z

Cool, thanks

also may pay to add an example in visualization.rst

Agreed, though it's the kind of thing people like to contribute, so if it's OK I'd rather open it as a good first issue (once this PR's in) and mentor someone through it

jreback

can you update the visualization.rst with an example of using this

jreback · 2021-01-25T16:45:15Z

doc/source/whatsnew/v1.3.0.rst

@@ -52,6 +52,7 @@ Other enhancements
 - :meth:`DataFrame.apply` can now accept NumPy unary operators as strings, e.g. ``df.apply("sqrt")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
 - :meth:`DataFrame.apply` can now accept non-callable DataFrame properties as strings, e.g. ``df.apply("size")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
 - :meth:`Series.apply` can now accept list-like or dictionary-like arguments that aren't lists or dictionaries, e.g. ``ser.apply(np.array(["sum", "mean"]))``, which was already the case for :meth:`DataFrame.apply` (:issue:`39140`)
+- :meth:`DataFrame.plot.scatter` can now accept a categorical column as the argument to ``c`` (:issue:`31357`)


let's mention the other issue here (and just close it too)

MarcoGorelli · 2021-01-26T21:13:54Z

Sure, added - here's what the new part in visualisation.rst looks like:

jreback

one tiny comment. ping on green

jreback · 2021-01-27T14:05:41Z

doc/source/user_guide/visualization.rst

@@ -579,6 +582,19 @@ each point:
   df.plot.scatter(x="a", y="b", c="c", s=50);


+.. ipython:: python


can you add a versionadded tag 1.3 here

jreback · 2021-01-27T15:47:52Z

thanks @MarcoGorelli

MarcoGorelli added 8 commits May 21, 2020 15:49

add legend with colors if coloring by categorical

d8d8c3c

add legend with colors if coloring by categorical

a167313

add legend with colors if coloring by categorical

d10c5c1

add legend with colors if coloring by categorical

7294aa2

add legend with colors if coloring by categorical

50cd05f

add legend with colors if coloring by categorical

4579142

add test

6d3fe9e

revert empty line

3846804

WillAyd added the Visualization plotting label May 21, 2020

WillAyd reviewed May 21, 2020

View reviewed changes

MarcoGorelli added 3 commits May 22, 2020 11:43

discrete colorbar in case of ordered categorical

ef4b03d

cleanup

cf218f7

cleanup

7aae164

MarcoGorelli added 3 commits May 22, 2020 13:49

plot colorbar in both cases

0ab903d

update test

b93ddf1

update test

5fbc117

dsaxton added the Needs Review label Sep 16, 2020

dsaxton mentioned this pull request Sep 16, 2020

CI: Add stale PR action #36336

Merged

MarcoGorelli added 2 commits January 3, 2021 15:57

Merge remote-tracking branch 'upstream/master' into categorical-scatter

572ecfc

🎨

b2a8b28

jreback requested changes Jan 3, 2021

View reviewed changes

ivanovmg reviewed Jan 11, 2021

View reviewed changes

MarcoGorelli added 3 commits January 17, 2021 09:12

Merge remote-tracking branch 'upstream/master' into categorical-scatter

efaaae6

simplify logic

b0a8cfa

whatsnew entry

4c15a83

Merge remote-tracking branch 'upstream/master' into categorical-scatter

b65b103

Merge remote-tracking branch 'upstream/master' into categorical-scatter

6560bb0

jreback added this to the 1.3 milestone Jan 25, 2021

jreback added Enhancement and removed Needs Review labels Jan 25, 2021

jreback requested changes Jan 25, 2021

View reviewed changes

add example to visualisation

cce3461

jreback approved these changes Jan 27, 2021

View reviewed changes

add versionadded tag

6e01091

jreback merged commit d0cfa03 into pandas-dev:master Jan 27, 2021

MarcoGorelli deleted the categorical-scatter branch January 27, 2021 16:04

Andy-Grigg mentioned this pull request Feb 3, 2022

BUG: df.plot.scatter() with norm keyword fails with 'multiple values for keyword argument' TypeError #45809

Closed

3 tasks

		@@ -579,6 +582,19 @@ each point:
		df.plot.scatter(x="a", y="b", c="c", s=50);


		.. ipython:: python

ENH: categorical scatter plot #34293

ENH: categorical scatter plot #34293

Conversation

MarcoGorelli commented May 21, 2020 • edited by jreback Loading

WillAyd commented May 21, 2020

MarcoGorelli commented May 21, 2020

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlesdong1991 commented May 21, 2020 • edited Loading

WillAyd commented May 21, 2020

MarcoGorelli commented May 21, 2020

charlesdong1991 commented May 21, 2020 • edited Loading

MarcoGorelli commented May 21, 2020

charlesdong1991 commented May 21, 2020 • edited Loading

MarcoGorelli commented May 22, 2020

charlesdong1991 commented May 22, 2020 • edited Loading

jorisvandenbossche commented May 22, 2020

charlesdong1991 commented May 22, 2020 • edited Loading

charlesdong1991 commented May 22, 2020

MarcoGorelli commented May 22, 2020

charlesdong1991 commented May 22, 2020 • edited Loading

MarcoGorelli commented May 24, 2020

MarcoGorelli commented Jul 5, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 3, 2021

ivanovmg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Jan 17, 2021 • edited Loading

ivanovmg commented Jan 18, 2021

MarcoGorelli commented Jan 19, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Jan 26, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 27, 2021

MarcoGorelli commented May 21, 2020 •

edited by jreback

Loading

charlesdong1991 commented May 21, 2020 •

edited

Loading

charlesdong1991 commented May 21, 2020 •

edited

Loading

charlesdong1991 commented May 21, 2020 •

edited

Loading

charlesdong1991 commented May 22, 2020 •

edited

Loading

charlesdong1991 commented May 22, 2020 •

edited

Loading

charlesdong1991 commented May 22, 2020 •

edited

Loading

MarcoGorelli commented Jan 17, 2021 •

edited

Loading