Skip to content

VIS/ENH Hexbin plot #5478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 14, 2014
Merged

VIS/ENH Hexbin plot #5478

merged 1 commit into from
Feb 14, 2014

Conversation

TomAugspurger
Copy link
Contributor

This is just 10 minutes of copy-paste cargo-culting to gauge interest, I haven't tested anything yet.

In [1]: df = pd.DataFrame(np.random.randn(1000, 2))

In [2]: df.plot(kind='hexbin', x=0, y=1)
Out[2]: <matplotlib.axes.AxesSubplot at 0x10eb76e10>

hexbin

It's not terribly difficult to do this on your own, so my feeling wouldn't be hurt at all if people are -1 on this :)

EDIT: oops I branched from my to_frame branch. I'll clean up the commits.

@jtratner
Copy link
Contributor

jtratner commented Nov 9, 2013

looks interesting

@TomAugspurger
Copy link
Contributor Author

Colorbar on by default?

hexbin

Also, apparently the default matplotlib colormap, jet, is widely despised. I'm looking around for a better default.

@TomAugspurger
Copy link
Contributor Author

Thoughts on a default color pallet? Seaborn seems to use cubehelix (screenshot)

hexbin

This may be helpful. @olgabot seems to use different cmaps for different data values. That may be a bit too much for pandas.

EDIT: oh I also think this should spend some time incubating on master. Once the RC branch is forked, I'll add some release notes for .14;

@TomAugspurger
Copy link
Contributor Author

Oh, also, I hate the matplotlib argument names for the value (C) and reduction function reduce_C_function.

A brief overview:

by default (C=None), compute a histogram in each bin. Essentially each (x, y) point in a bin has a value of 1, and the reduction function is sum (i.e. count since every value is 1.)

If you specify C (another column of df for pandas), then each point is a 3 tuple (x, y, c): (x, y) still specify what bin you end up in, and c determines the value passed to the reduction function reduce_C_function

What's the policy on respecting other libraries' function arguments. Other pandas plotting functions seem to use colormap instead of matplotlib's cmap.

If we're willing to change, I'd suggest renaming C to value and reduce_C_function to reduce or reduce_func

Here's an example to play with if that helps:

df = DataFrame({"A": np.random.uniform(size=20),
                "B": np.random.uniform(size=20),
                "C": np.arange(20) + np.random.uniform(size=20)})

ax = df.plot(kind='hexbin', x='A', y='B', gridsize=10)  # histogram by default
ax = df.plot(kind='hexbin', x='A', y='B', C='C', reduce_C_function=np.max)

So in the second case the color of a bin will be determined by the maximum value of C (from column C in df) in that bin.

@olgabot
Copy link

olgabot commented Nov 24, 2013

While cubehelix solves the saturation problem, it doesn't solve the issue that the rainbow colormap is harmful and does not help with interpretation of the heatmap. The linked paper gives great examples how the changing colors introduce patterns in the data that don't actually exist and can lead to over-interpretation of color changes.

I don't think using different colormaps for different data ranges is that complicated, in prettyplotlib it's a simple check of the data's max and min and using a divergent heatmap like blue-red if there's negative and positive values, or a sequential heatmap if there's only positive or negative values. These colormaps are built into matplotlib

I'm still working on my matplotlib install. I have some code for a clustered heatmap as well (like heatmap.2 in R) that would use but it depends on scipy, so I'm guessing I'd have to implement hierarchical clustering myself to reduce pandas's dependencies. Plus as a data science/bioinformatics purist I'd want it to have optimal leaf ordering, too. This is the default in MATLAB, but not in R or scipy.

As for renaming, I agree that the matplotlib arguments are cryptic but I personally prefer consistency over introducing new standards: http://xkcd.com/927/

@mwaskom
Copy link
Contributor

mwaskom commented Nov 25, 2013

I set the matplotlib default colormap to cubehelix in seaborn because I wanted to banish jet and that seemed like a reasonably non-crappy alternative that is at least somewhat adaquate for most data. I don't think the hue shifts are as big a problem in cubehelix as in rainbow maps and jet because they're pretty gradual and accompanied by a shift in lightness/saturation.

But in general I agree that there's no good one-size-fits-all solution to colormaps, and it's better to adapt to the data. For the corrplot function the default colormap is coolwarm, which I tend to prefer a bit to RdBu as the extreme values aren't quite as dark.

@TomAugspurger
Copy link
Contributor Author

Thanks for the feedback! I'll think on this, but adapting to the data would be complicated by the business with C and reduce_C_function. That's all handled on the matplotlib side, so I won't know the range of the values when I hand everything off to matplotlib's hexbin.

The default behavior is to just do counts, so things will always be positive in that case. It might make sense just to pick the default with that case in mind, and assume that people doing fancier things can choose an appropriate color palette.

@TomAugspurger
Copy link
Contributor Author

@olgabot scipy is a soft dependency for pandas, so that should be OK. I think the Gaussian KDE plot also depends on scipy.

@mwaskom
Copy link
Contributor

mwaskom commented Nov 25, 2013

Ah, yeah, if the data are counts then it probably makes sense to use a sequential map; any of the colorbrewer ones should work.

@TomAugspurger
Copy link
Contributor Author

I think this is ready to go. Thanks for the suggestions on the colormaps. I'm going with BuGr as the default since the default aggregation is by counts, which will always be nonnegative. If you're using a different aggregation then you should choose the appropriate colormap.

One thing I'm a tad concerned about is the docstring on df.plot. I thew in a Notes with things specific to hexbin. I wonder if we should put explanations in a separate function like scipy.integrate.quad_explain() does. Or I could remove that and just point people to the html docs.

Let me know when your ready and I'll rebase and squash.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2013

@TomAugspurger you want to throw in 0.13?

pls change the release notes, rebase, and squash...

@jreback
Copy link
Contributor

jreback commented Dec 3, 2013

if you think the API might change, then just label it experimental (if you want)

@jreback
Copy link
Contributor

jreback commented Dec 3, 2013

does this have an associated issue?

@TomAugspurger
Copy link
Contributor Author

No issue associated. I just put the PR number in the release notes. I'm pretty confident the API won't need adjusting, but I guess labeling it experimental is the safe thing to do.

Ready when you and Travis say it's good!

@jreback
Copy link
Contributor

jreback commented Dec 3, 2013

perfect

@cpcloud @jtratner @y-p any comments?

@ghost
Copy link

ghost commented Dec 4, 2013

This looks nice.

@TomAugspurger, do you know if binhex plots are available via yhat/ggplot ?
If not, it'd be great to help them grow the library, they're building it on top of pandas already.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2013

looks ready to merge...any objections? @TomAugspurger @y-p @jtratner ?

@ghost
Copy link

ghost commented Dec 5, 2013

Merging new features between RC and final?

@jreback
Copy link
Contributor

jreback commented Dec 5, 2013

seems non-invasive to me

@TomAugspurger
Copy link
Contributor Author

I'm not in a rush to get this in if that's a problem.

@jreback
Copy link
Contributor

jreback commented Dec 7, 2013

@TomAugspurger push this to 0.14, @y-p?

@TomAugspurger
Copy link
Contributor Author

Fine by me. Once .13 is out I'll make the doc changes and ping you when I get that done (could be a little while).

@ghost
Copy link

ghost commented Dec 7, 2013

Yup, and to stay on track we should be more aggressive about tagging 0.13 final when it's time
and moving on to 0.14, release channels can be async to that. Just a headsup to pandas-dev
when bug queue is empty and green across platforms.

@ghost
Copy link

ghost commented Dec 16, 2013

@TomAugspurger, so is this plot available out of the box with seaborn which is built on top of pandas?

@olgabot
Copy link

olgabot commented Dec 16, 2013

Not yet:
http://stanford.edu/~mwaskom/software/seaborn/plotting_distributions.html


Olga Botvinnik
PhD Program in Bioinformatics and Systems Biology
Gene Yeo Laboratory | Sanford Consortium for Regenerative Medicine
University of California, San Diego
olgabotvinnik.com
blog.olgabotvinnik.com
github.com/olgabot

On Mon, Dec 16, 2013 at 10:38 AM, y-p [email protected] wrote:

@TomAugspurger https://github.com/TomAugspurger, so is this plot
available out of the box with seaborn which is built on top of pandas?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5478#issuecomment-30687094
.

@ghost
Copy link

ghost commented Jan 7, 2014

I resisted other PRs for more sophisticated plots but this is closer to home, should be ok.

@TomAugspurger
Copy link
Contributor Author

I've been a tad worried about inconsistency in what vis PRs are accepted as well, so I haven't wanted to push to get this in. That said I think this should go in since I see hexbin plots as a drop-in replacement for scatter plots when you have a bunch of points.

Plus Wes tweeted about it, and we can't make him a liar...

@jreback
Copy link
Contributor

jreback commented Jan 7, 2014

hahah.!

ok....

pls rebase and put in release notes (0.13.1)..thxs

@@ -93,6 +93,8 @@ Experimental Features
- Added PySide support for the qtpandas DataFrameModel and DataFrameWidget.
- Added :mod:`pandas.io.gbq` for reading from (and writing to) Google
BigQuery into a DataFrame. (:issue:`4140`)
- Hexagonal bin plots from ``DataFrame.plot`` with ``kind='hexbin'``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to 0.13.1 section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to 0.14.0 section?

@@ -623,6 +623,7 @@ Enhancements
output datetime objects should be formatted. Datetimes encountered in the
index, columns, and values will all have this formatting applied. (:issue:`4313`)
- ``DataFrame.plot`` will scatter plot x versus y by passing ``kind='scatter'`` (:issue:`2215`)
- Hexagonal bin plots from ``DataFrame.plot`` with ``kind='hexbin'`` (:issue:`5478`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to v0.13.1.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use the docs example here (its kind of cool) (and possibly put a link to the doc section as well)

@ghost
Copy link

ghost commented Jan 7, 2014

I still think 0.14.0... don't make me out to be a softie.

@TomAugspurger
Copy link
Contributor Author

I threw everything in .14, so this should be good to sit until we start on that.

I'm having trouble building the docs right now (on master too, not just this branch). I'll try to track down what's wrong with my environment and then I'll let you know that everything builds correctly.

@TomAugspurger
Copy link
Contributor Author

ping @jreback if you're ready to merge.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@TomAugspurger trivial release notes change..then good 2 go

@TomAugspurger
Copy link
Contributor Author

@jreback Fixed that. I removed the experimental tag too.

jreback added a commit that referenced this pull request Feb 14, 2014
@jreback jreback merged commit cab2a93 into pandas-dev:master Feb 14, 2014
@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

gr8 thanks @TomAugspurger !

@TomAugspurger TomAugspurger deleted the hexbin-plot branch November 3, 2016 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants