Skip to content

VIS: added ability to plot DataFrames and Series with errorbars #5638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 18, 2014

Conversation

r-b-g-b
Copy link
Contributor

@r-b-g-b r-b-g-b commented Dec 4, 2013

close #3796

Addresses some of the concerns in issue #3796. New code allows the DataFrame and Series Line plots and Bar plot functions to include errorbars using xerr and yerr keyword arguments to DataFrame/Series.plot(). It supports specifying x and y errorbars as 1. a separate list/numpy/Series, 2. a DataFrame with the same column names as the plotting DataFrame. For example, using method 2 looks like this:

df = pd.DataFrame({'x':[1, 2, 3], 'y':[3, 2, 1]})
df_xerr = pd.DataFrame({'x':[0.6, 0.2, 0.3], 'y':[0.4, 0.5, 0.6]})
df_yerr = pd.DataFrame({'x':[0.5, 0.4, 0.6], 'y':[0.3, 0.7, 0.4]})

df.plot(xerr=df_xerr, yerr=df_yerr)

This is my first contribution. I tried to follow the contribution guidelines as best I could, but let me know if anything needs work!

@jreback
Copy link
Contributor

jreback commented Dec 4, 2013

need a test that tests passing invalid error bars (u raise in the code - just need to exercise that)
also if invalid data is passed for series would it raise as well?

the presence of errorbar keywords.
'''
if (('yerr' in self.kwds) and (self.kwds['yerr'] is not None)) or \
(('xerr' in self.kwds) and (self.kwds['xerr'] is not None)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this equivalent to if self.kwds.get('yerr') or self.kwds.get('xerr')?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if I understand correctly, you're saying you could accomplish this more cleanly with:

yerr = self.kwds.get('yerr')
xerr = self.kwds.get('xerr')
    if yerr is None and xerr is None:
        plotf = self.plt.Axes.plot
        plotf_name = 'plot'
    else:
        plotf = self.plt.Axes.errorbar
        plotf_name = 'errorbar'

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Dec 4, 2013

@jreback I added some tests to make sure the right exceptions were being raised when invalid error arguments were passed.

I tested two cases where xerr/yerr arguments were:

  1. different lengths from the plotted series (ValueError)
  2. the wrong data type (TypeError)

Were these the kinds of cases you had in mind?
Both of these errors are raised in the underlying matplotlib code-- do you think it's necessary to catch them in the pandas plotting code? Also, are there any other tests you'd like to see?

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Dec 6, 2013

Just wanted to check in and see if there was anything you all thought needed work, since I'll have some time over the weekend to spend on this. Thanks!

@TomAugspurger
Copy link
Contributor

I'll look at this more closely tomorrow.

Is this supposed to work?

In [26]: df
Out[26]: 
     x   y  error
0    0  12    0.4
1    1  11    0.4
2    2  10    0.4
3    3   9    0.4
4    4   8    0.4
5    5   7    0.4
6    6   6    0.4
7    7   5    0.4
8    8   4    0.4
9    9   3    0.4
10  10   2    0.4
11  11   1    0.4

[12 rows x 3 columns]

In [27]: df.plot(yerr='error')
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-27-476acf361e57> in <module>()
----> 1 df.plot(yerr='error')

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas/pandas/tools/plotting.py in plot_frame(frame, x, y, subplots, sharex, sharey, use_index, figsize, grid, legend, rot, ax, style, title, xlim, ylim, logx, logy, xticks, yticks, kind, sort_columns, fontsize, secondary_y, **kwds)
   1820                              secondary_y=secondary_y, **kwds)
   1821 
-> 1822     plot_obj.generate()
   1823     plot_obj.draw()
   1824     if subplots:

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas/pandas/tools/plotting.py in generate(self)
    876         self._compute_plot_data()
    877         self._setup_subplots()
--> 878         self._make_plot()
    879         self._post_plot_logic()
    880         self._adorn_subplots()

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas/pandas/tools/plotting.py in _make_plot(self)
   1369                         kwds['yerr'] = yerr[label]
   1370                 elif yerr is not None:
-> 1371                     kwds['yerr'] = yerr[i]
   1372 
   1373                 if isinstance(xerr, DataFrame):

IndexError: list index out of range

It plots the error bars for one of the Series. You may want to raise a TypeError here if you think thats ambiguous, or just apply the same error bars to each. I also think that yerr and xerr should accept a dict of column names, where the key is the particular series and the value is the column of errors to apply to that series.

@TomAugspurger
Copy link
Contributor

In this case:

In [66]: df
Out[66]: 
     x   y  error
0    0  12    0.4
1    1  11    0.4
2    2  10    0.4
3    3   9    0.4
4    4   8    0.4
5    5   7    0.4
6    6   6    0.4
7    7   5    0.4
8    8   4    0.4
9    9   3    0.4
10  10   2    0.4
11  11   1    0.4

[12 rows x 3 columns]

In [67]: df_err
Out[67]: 
      x  y
0   0.2  2
1   0.2  2
2   0.2  2
3   0.2  2
4   0.2  2
5   0.2  2
6   0.2  2
7   0.2  2
8   0.2  2
9   0.2  2
10  0.2  2
11  0.2  2

[12 rows x 2 columns]

How do you decide to use just df_err['y'] for the bars (I think that's what you're doing; I don't think the xs are being covered).

Also, I'm thinking about a good way to accept asymmetric error bars. A sequence of tuples or two arrays of the same length (you might already handle this). More feedback tomorrow hopefully!

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Dec 9, 2013

Yes, I think broadcasting one error column to all of the data columns should be an option -- it should be doable by adding to the _parse_errorbars function.

It does prevent the user from being able to plot some data with error bars and some without. But in those cases, they can use the "key-matched error DataFrame" -- if a label is not present in that column, the data will be plotted without error bars. Overall, I like this method since it is most explicit and least prone to unintended consequences, I think.

And yes, being able to pass an error dict is a good idea. I changed the code to implement this by taking advantage of the syntactic similarity of DataFrames and dicts (e.g. if you have type(df) is DataFrame and type(d) is dict, you can do 'x' in df.keys() and 'x' in d.keys() and df['x'] and d['x'], and both will work. (although I don't know if this is considered good/safe coding practice, thoughts?).

I decided not to include x errors for bar plot because I don't think I've ever seen one with x errors, but you're probably right that they should be included. (I hadn't considered barh, and also, who am I to say you shouldn't have x error bars on vertical bar plots?)

As for asymmetrical error bars, I was thinking of implementing something like yerr_upper/yerr_lower since then you could organize it using error DataFrames/dicts like before, but it gets a bit messy. I'll give the method you suggested a shot -- thanks for your help!

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Dec 9, 2013

I added some documentation as well, but I only have a loose grasp of Sphinx, so I could have mangled it a bit (@TomAugspurger, I pretty much copied the structure from your hexbin-plot commit). Let me know if it needs work!

@@ -221,6 +221,11 @@ Improvements to existing features
MultiIndex and Hierarchical Rows. Set the ``merge_cells`` to ``False`` to
restore the previous behaviour. (:issue:`5254`)
- The FRED DataReader now accepts multiple series (:issue`3413`)
- DataFrame/Series .plot() functions support plotting with error bars by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be moved to the .14 section once that is created.

@TomAugspurger
Copy link
Contributor

Sorry it took so long to get back to this; it fell of my radar. Most of my comments are inline.

I need a bit longer to look at your changes to get_plot_function.

Also there's a lot of repeated code. There are blocks that do something for yerr and then do something for xerr. See if you can factor those into their own function or closure.

Same thing for _parser_error_bars and parse_error_bars_for_sereis. It would be better to do as much of that in one place as possible. I haven't had a close look yet, but I'll see if there's a way to reduce the repetition.

Thanks for doing this. It's a pretty tricky API to work out, but I think it looks pretty good so far.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@gibbonorbiter can you rebase this so we can take a look?

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Feb 17, 2014

Thanks for the comments so far. I think it's possible to do away with _parse_errorbars_for_series. I had been using it when calling plot_series, but all of the use cases can be dealt with in the _parse_errorbars call that takes place in the _make_plot function.

One case where it is still useful is a pretty specific case: calling df.plot() (uses plot_frame), and specifying both the y-data and yerr/xerr data with strings (or ints) indicating columns from that DataFrame (y and yerr keyword arguments). The problem is that only the y-data column is passed onto the resulting plot_series call, the rest of the DataFrame does not make it through, so downstream code does not have access to the error column(s). So either the error columns need to be broken out of the DataFrame before the call to plot_series (that's what _parse_errorbars_for_series was doing, but it could be done without farming out to a special function), or I might have to make some more drastic changes to the code. Or this functionality could be abandoned, the user would just have to say yerr=df['the_err_column'] instead of yerr='the_err_column'. Not too much to ask. At this point, I agree that _parse_errorbars_for_series should go, and we should find a workaround.

I did manage to find a SublimeText plugin that highlights and removes trailing whitespace, thanks for pointing that out, I had no idea :)

Also, I did a rebase but I'm still a little weak on the git-fu, hopefully I didn't mangle it too badly.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@gibbonorbiter can you rebase....

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Mar 10, 2014

There were some hairy merge issues with the docs, so I omitted those commits until the code checks out. Let me know if I need to change anything!

@jreback
Copy link
Contributor

jreback commented Mar 10, 2014

@TomAugspurger can u review when u have a chance
thanks

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@gibbonorbiter also pls squash this down to a smaller number of commits as well

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

going to need an entry in release.rst (under improvements), and at least a 1-liner in v0.14.0.txt. optional would be to include a graphic of this (if you think it would materially add to the whatsnew). And pls add a small section in the plotting.rst (here I would put an example though). You can doc in this PR.

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Mar 11, 2014

@jreback sounds good. would you prefer to have it squashed down to just one commit with a title like "VIS: added ability to plot DataFrames and Series with errorbars"?

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

a small number is fine since you made a lot of changes. (1 ok too!) usually I try to do them logically, e.g. tests in 1, changes in another, docs in another. but usually too much work to do that.

@TomAugspurger
Copy link
Contributor

@gibbonorbiter from the example in your original post:

gh

It looks like the ylim isn't being adjusted. You can document that and leave it as another issue if you want (It's probably impossible to come up with a perfect solution, so maybe make a note and let the user pick the limits?)

@TomAugspurger
Copy link
Contributor

I'm compiling a list of what should / shouldn't work as far as the types of df and xerr/yerr go. Let me know if I'm missing anything.

List of Supported APIs

self is df, other is *err.

  1. self is a DataFrame
    a. *err is a DataFrame with matching columns, values are widths
    b. *err is a str, indicating a columin in df
    c. *err is a dict, keys matching df.columns, values are widths
    d. *err is a DataFrame with partially columns, Fails with TypeError: unsupported operand type(s) for -: 'numpy.int64' and 'str'
    after plotting the errorbars / line for the one that does match. This should either raise or pass, plotting line + error bars for matches and just lines for the non-matchings.
    e. *err is a DataFrame with no matching columns. Fails with
    TypeError: unsupported operand type(s) for -: 'numpy.int64' and 'str'
    f. self is an MxN DataFrame, other is an MxN array of values (unlabeled). Fails with ValueError: In safezip, len(args[0])=3 but len(args[1])=2. It's OK that this doesn't work. May want to catch (if possible) and report a better error.
    h. self is an MxN DataFrame, other is an NxM array. Plots correctly.
    g. self is an MxN DataFrame, other is an Nx2xM array. plots asymmetrical error bars.
  2. self is a Series
    a. *err is a DataFrame, self.name is set, matches col in other, values are widths
    b. *err is a DataFrame, self.name is set, doesn't match col in other (Type error (adding str + int) right now, should raise ValueError?)
    c. *err is a DataFrame, self.name is None, same behavior is 2b.
    d. *err is a Series with other.name set, self.name is None, the name on other is ignored (this is probably ok.)
    e. *err is a Series with other.name set, self.name is set, but dones't match other.name. The names are ignored and it works (probably ok?)
    f. self.name is set or not, *err is arraylike (works).
    g. self is M, and *err is an NxM array. This plots, but I'm not sure what values. Just the first row of `*err?

Concerns

  • What to do about the other axis? What if only some of the index labels match (we should only use those).
  • When self is a Series, other is a DataFrame, and no columns match: TypeError: unsupported operand type(s) for -: 'numpy.int64' and 'str'
  • Differing lengths: raises ValueError: In safezip, len(args[0])=3 but len(args[1])=2. Catch earlier (style thing maybe?).
  • Maybe it would be better (easier to implement) to be strict about having names always match?
  • I would ignore asymmetrical error bars for now. That's a good enhancement for the future, but you probably want to get this merged first (I want it merged; doing error bars manually sucks). (EDIT: Looks like you support it already, I added cases 1h and 1g above).

My biggest concern is the first one, checking the index labels.

I'm going to look into the code now.

str: the name of the column within the plotted DataFrame
'''

error_dim = error_dim[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What values does error_dim take other than x and y?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since error_dim will only ever "x" and "y" you shouldn't need that line. "x"[0] is the same as "x".

@TomAugspurger
Copy link
Contributor

If we do want to respect the index labels as well when *err is a Series or DataFrame, you can use err.reindex_axis(self.index) or something like that. Example:

In [199]: df
Out[199]: 
          0         1         2
0 -0.890001 -0.943281  0.432457
1 -1.923215  0.585109 -1.007599
2  0.142306 -0.138677 -0.161226
3 -0.318436  1.002733 -1.801434
4  1.151089  0.035163 -0.778506

[5 rows x 3 columns]

In [200]: yerr
Out[200]: 
1    0.432457
2    1.007599
3    0.161226
4    1.801434
5    0.778506
Name: 2, dtype: float64

In this case, the error bars would only be plotted for index labels [1, 2, 3, 4]. 0 and 5 would have no error bars. You could use err.reindex_axis(self.index).fillna(0). Then everyone has error bars, just some have length 0 (that might work, haven't tested).

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Mar 13, 2014

Thanks for the comments @TomAugspurger. I just pushed some changes allowing for specifying errors for only a subset of the columns and I'll work on the rest as I get some time. And yes, error_dim can only be x and y. I just have it there to avoid duplicating code for the x and y errors, but do you have a better way in mind?

@TomAugspurger
Copy link
Contributor

This is looking really good. For this index label matching, I think something like

        def match_labels(data, err):
            err = err.reindex_axis(data.index).fillna(0)
            return err

        if isinstance(err_kwd, dict):
            err = err_kwd

        if isinstance(err_kwd, DataFrame):
            err = err_kwd
            err = match_labels(self.data, err)

        # Series of error values
        elif isinstance(err_kwd, Series):
            # broadcast error series across data
            err = np.atleast_2d(err_kwd.values)
            err = match_labels(self.data, err)
            err = np.tile(err, (self.nseries, 1))

will work.

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Mar 17, 2014

Thanks for that nice fix @TomAugspurger. It looks like plt.errorbar can also support np.nan entries, and handles them by not drawing any errorbars at all, so I took out the .fillna(0). Still seems to work fine, but did you have something else in mind when you added that?

@TomAugspurger
Copy link
Contributor

That's good that it handles NaNs. That's the outcome that I wanted.

This looks like its about there. Could you add a bit of documentation stuff?

  • a oneline note in doc/source/releate.rst
  • a bit more in doc/source/v0.14.0.txt, including an example if you want
  • an example in doc/source/visualization.rst as a new subsection under "Basic plotting: plot".

Then we can get this merged!

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Mar 17, 2014

Ok, just added some basic documentation. I tried to render them using doc/make.py and it looks like it worked, but you might want to look it over since I've never done it before. Let me know if you'd like anything changed!

@TomAugspurger
Copy link
Contributor

Looks like some other commits got into the PR. Did you merge master into your branch? Could you revert back to before that, then rebase on top of master? Let me know if you have any issues.

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Mar 18, 2014

Shoot, I might need some git guidance to fix this one. Can you explain in a little more detail what I should do?

@TomAugspurger
Copy link
Contributor

Sure thing. Did you do a git merge upstrem/master, or git rebase upstream/master or git pull?

Also, do a git reflog and pass the top 5 or so in here. We're going to reset back to where you were before merging in the other commits.

Then do a rebase:

git rebase -i upstream/master

change the picks to squash or fixup. You may have already done this. Just get it down to 1 or 2 picks for the actual code changes and the docs.

@TomAugspurger
Copy link
Contributor

Pretty much you'll want to take the hash of your last good commit. reset to that with

git reset --hard <hash>

Then do the git rebase -i. Then force push your branch, with git push -f origin.

TomAugspurger pushed a commit that referenced this pull request Mar 18, 2014
VIS: added ability to plot DataFrames and Series with errorbars
@TomAugspurger TomAugspurger merged commit 08e0a96 into pandas-dev:master Mar 18, 2014
@TomAugspurger
Copy link
Contributor

@gibbonorbiter Looks good. Thanks for submitting this!

@r-b-g-b
Copy link
Contributor Author

r-b-g-b commented Mar 19, 2014

Thank you all for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VIS: errorbar plotting
3 participants