-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: categorical scatter plot #34293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is passing a string argument to |
It's already supported, see this example from the docs: However, currently, it's required that the column be numeric. This PR would add support for the categorical case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good. A few comments @TomAugspurger
pandas/plotting/_matplotlib/core.py
Outdated
@@ -965,7 +967,10 @@ def _make_plot(self): | |||
elif color is not None: | |||
c_values = color | |||
elif c_is_column: | |||
c_values = self.data[c].values | |||
if not is_categorical_dtype(self.data[c]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but instead of not is_categorical_dtype
can you leave this as is_categorcial_dtype
and switch the branches?
pandas/tests/plotting/test_frame.py
Outdated
df["species"] = pd.Categorical( | ||
["setosa", "setosa", "virginica", "virginica", "versicolor"] | ||
) | ||
_check_plot_works(df.plot.scatter, x=0, y=1, c="species") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there way within the test to check the legend is drawn and the appropriate colors are applied?
emm, I think what do you think? |
So this covers unordered well - do we need to make any considerations for ordered categoricals as well? |
This functionality already exists though, see the last example from here where they pass ax2 = df.plot.scatter(x='length',
y='width',
c='species',
colormap='viridis') The only difference is that currently, you can only pass
This should work with both, but in the issue linked above (#12380) they mention using discrete colorbars for ordered categoricals. That's a possibility, though that would make this PR bigger - perhaps it's best to have one PR deal with categoricals, and then another one to split between ordered and unordered? Motivation for working on this comes from how in ggplot(data=Iris,aes(x=SepalWidthCm, y=SepalLengthCm,color=Species)) + geom_point() and it works even if |
yeah, I know the functionality does exist, but based on the definition:
Using numeric values is because they are used to select colors from colormap. but if uses categorical values, i think it will only use the first N colors from colormap (N is the number of categories). So this doc should be changed and well documented to avoid confusion I think. And @WillAyd has a good point on sorted categorical values, and in this case, the colors selected from colormap are not the first N, but based on the order, so would be nice to test for those cases. |
@charlesdong1991 ah I see what you mean now, thanks for clarifying. Yes, that's a good point, will think about this |
you are welcome @MarcoGorelli
sorry, nvm, this PR has supported unordered case, so above does not make sense, sorry again for the noise |
Here's what I'm currently working on: if we start with import matplotlib as mpl
import numpy as np
import pandas as pd
CMAP = "viridis"
df = pd.DataFrame(
[[5.1, 3.5], [4.9, 3.0], [7.0, 3.2], [6.4, 3.2], [5.9, 3.0]],
columns=["length", "width"],
) then we get the following: Will try to get a commit in with tests by the end of the day, but if the above screenshots indicate I'm on the wrong path please do let me know |
I think, the second plot looks pretty good, and should be what it looks like if supporting categorical values. And I have mixed feeling about the first one, from a user perspective, this will bring confusions to users because the behaviour of unsorted categorical values are different than others for So IMHO, for categorical features, sorted and unsorted, they should both have the second plot look. And the difference between those two, will be what colors the categories will represent in colormap based on the order. |
I think it would in general be very nice to support "categorical" data (whether it is actually categorical dtype, or with non-numerical dtype) in a scatter plot. But some comments:
|
@jorisvandenbossche I agree that it's also an option to have legend than colorbar for categorical values. But the main concern is consistency of the meaning and expectation of So, if using legend, this change will need to be well documented for |
emm, sorry, i think maybe legend looks a bit better in case of categorical values. i just tried out to use specific discrete colors to assign to The color bar is there but does not make sense, so probably we should make changes for such cases too (another PR though). |
Is that for both ordered and unordered cases?
OK, have opened #34316. |
yeah, right now,
I think color names and color codes are natively and widely supported in matplotlib, so not sure if we even need to put efforts distinguishing it, in other words, no matter if it is already categorical type or general object type, they should be plotted correctly. Therefore, I think, even without this PR, the code below should also work (haven't tried out though, might wrong): df = pd.DataFrame(
[[5.1, 3.5], [4.9, 3.0], [7.0, 3.2], [6.4, 3.2], [5.9, 3.0]],
columns=["length", "width"],
)
df['specifics'] = pd.Categorical(['r', 'b', 'b', 'r', 'g'])
df.plot.scatter(x=0, y=1, c='specifics')
I, as a user, the I think the feature @MarcoGorelli wants to implement here is when categorical values are not color names which matplotlib cannot recognise, e.g. random category names. And we need to figure out how they could map to colors. Therefore, to me, supporting sorted categorical values are more intuitive, because order can be viewed as discrete numbers, and we could use them to find color in colormap (that's how they use numeric values to find colors, but difference is now in this case, the numeric values are discrete, so its kind of a special case of numeric values). Thus, I would prefer to see discrete names in color bar to be plotted (as in his second plot), which could presents the link between the random category name and the colormap. And for unsorted categorical values, we could just treat them as the order is the order of appearance, like how we do to unsorted objects. So above is the main reason why I would prefer to have discrete values in colormap for categorical cases than having them as legends (which also might bring in ambiguity to users) although they might not look as nice as legends in the plot Sorry to put so many words, and will be happy to discuss if you have further questions. @jorisvandenbossche |
Yes, exactly - I was thinking that the mapping could be done using the Anyway, I'll wait for further direction about whether a legend or a discrete colorbar would be preferred before adding further commits - thanks for the discussion/input so far! |
Hi all - any update on whether pandas would like to go ahead with this, and whether a discrete colorbar or a legend would be preferable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable to me, can you add a whatsnew note in other enhancements for 1.3
also may pay to add an example in visualization.rst
pandas/plotting/_matplotlib/core.py
Outdated
@@ -440,7 +441,9 @@ def _compute_plot_data(self): | |||
if is_empty: | |||
raise TypeError("no numeric data to plot") | |||
|
|||
self.data = numeric_data.apply(self._convert_to_ndarray) | |||
self.data = numeric_data.apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just change _convert_to_ndarray to handle this case? (and mention in the doc-string)
cc @charlesdong1991 @ivanovmg @TomAugspurger if any comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Couple of comments from my side.
# The workaround below is no longer necessary. | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment in the line above still relevant?
# The workaround below is no longer necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, yes, it's just saying that
points = ax.get_position().get_points()
cbar_points = cbar.ax.get_position().get_points()
cbar.ax.set_position(
[
cbar_points[0, 0],
points[0, 1],
cbar_points[1, 0] - cbar_points[0, 0],
points[1, 1] - points[0, 1],
]
)
# To see the discrepancy in axis heights uncomment
# the following two lines:
# print(points[1, 1] - points[0, 1])
# print(cbar_points[1, 1] - cbar_points[0, 1])
return cbar
is no longer necessary if mpl is of a modern enough version
pandas/plotting/_matplotlib/core.py
Outdated
elif c_is_column: | ||
c_values = self.data[c].values | ||
if color_by_categorical: | ||
c_values = self.data[c].cat.codes | ||
else: | ||
c_values = self.data[c].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe slightly modify the logic here?
...
elif color_by_categorical:
c_values = self.data[c].cat.codes
elif c_is_column:
c_values = self.data[c].values
else:
c_values = c
This allows one to eliminate the inner if-else
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good catch, thank you!
@pytest.mark.parametrize("ordered", [True, False]) | ||
@pytest.mark.parametrize( | ||
"categories", | ||
(["setosa", "versicolor", "virginica"], ["versicolor", "virginica", "setosa"]), | ||
) | ||
def test_scatterplot_color_by_categorical(self, ordered, categories): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to add assertions on the colorbar ticklabels for the ordered case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a simple way to do this using the public API from mpl, but can look into it further (or do you have a suggestion?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I look at my comment once again and I cannot figure out what kind of assertions I asked you to consider.
You already do check the ticklabels at the colorbar against the expectations.
Sorry for the noise.
Thanks for reviews, have updated. Looking through the thread, I think there's still no consensus on what to plot in the unordered categorical case. Legend (as in the first plot from #34293 (comment)), or colorbar (so, no difference to the unordered categorical case)? The current set of commits has the latter, i.e. IMO a discrete colorbar in both the ordered and unordered cases minimises the risk of confusion, and I think it looks alright |
+1 on the colorbar option for both ordered and unordered. |
Cool, thanks
Agreed, though it's the kind of thing people like to contribute, so if it's OK I'd rather open it as a good first issue (once this PR's in) and mentor someone through it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update the visualization.rst with an example of using this
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -52,6 +52,7 @@ Other enhancements | |||
- :meth:`DataFrame.apply` can now accept NumPy unary operators as strings, e.g. ``df.apply("sqrt")``, which was already the case for :meth:`Series.apply` (:issue:`39116`) | |||
- :meth:`DataFrame.apply` can now accept non-callable DataFrame properties as strings, e.g. ``df.apply("size")``, which was already the case for :meth:`Series.apply` (:issue:`39116`) | |||
- :meth:`Series.apply` can now accept list-like or dictionary-like arguments that aren't lists or dictionaries, e.g. ``ser.apply(np.array(["sum", "mean"]))``, which was already the case for :meth:`DataFrame.apply` (:issue:`39140`) | |||
- :meth:`DataFrame.plot.scatter` can now accept a categorical column as the argument to ``c`` (:issue:`31357`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's mention the other issue here (and just close it too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one tiny comment. ping on green
@@ -579,6 +582,19 @@ each point: | |||
df.plot.scatter(x="a", y="b", c="c", s=50); | |||
|
|||
|
|||
.. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a versionadded tag 1.3 here
thanks @MarcoGorelli |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Here's what this would look like:
Was just hoping to get initial feedback on whether this would be welcome
Somethings I still need to do:
_plot_colorbar
test_fash.sh
passes locally but there are some CI failures, need to reproduce / figure out why (those tests pass now, failure was probably unrelated)