-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: strange timeseries plot behavior #6608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess if I do data.pivot('date', 'region', 'value').interpolate().plot() or pivoted = data.pivot('date', 'region', 'value')
pivoted.index = pd.to_datetime(pivoted.index)
pivoted.interpolate(method='time').plot() I get something like what I wanted as it cleans up the missing values. Interpolate's a new feature I hadn't seen before. (cool!) Maybe this is all user-error but I had been doing groupby plotting like this for a while, and had been getting what looked like correct results. I feel there may actually be a bug here someplace, that groupby behavior seems so bizarre. |
Actually, interpolation is not really what I want, as it makes it seem as if there is more data than is actually present. I basically just want to see the various region series all plotted as they would be if they were plotted individually, except all together on the same axes. (and a loop like figure = plt.figure()
ax = figure.gca()
data = pd.read_csv('./data.csv', parse_dates='date')
for region in data.region.unique():
subset = data[data.region == region]
subset = subset.set_index('date')
subset.value.plot(ax=ax, label=region) seems to just over-write the axes) |
I think the groupby/plot issue seems certainly like a bug. I can't fully lay my hand on it, but I think it has something to do with combining regular/irregular timeseries. The issue with the xlim not respecting the data is because it is updated by the last group (while this is a smaller group), and this is a seperate issue I think (you can open another issue for that). The reason you get almost no points on the plot with |
Thanks @jorisvandenbossche, it looks like the "separate issue" is already filed as #2960 . So this one is just the groupby/plot weirdness. Did you mean to add to your comment? (it ends in what looks like the start to some code) |
@rosnfeld ah yes, I first wanted to add a code snippet how to do your last example easier, but this also had the same bug, but forgot to remove it. Removed it now |
so this is the exception that's in the top section? what is causing this? |
Yeah, the exception is the most alarming thing, though changing the data slightly causes some other incorrect behavior (missing/incorrect plotting, which is harder to spot/diagnose). I don't know what's causing it without further investigation, but I can try to investigate and hopefully submit a fix. (it will be my first time digging into the plotting code, not sure how involved it is) |
The error is occurring in def _get_xlim(lines):
left, right = np.inf, -np.inf
for l in lines:
x = l.get_xdata()
left = min(x[0].ordinal, left)
right = max(x[-1].ordinal, right)
return left, right The line I'm not too familiar with our Datetime code, but does anyone know why these aren't the same? In [40]: import datetime
In [41]: b = datetime.datetime(1997, 12, 31, 0, 0)
In [42]: y = pd.Period(year=1997, month=12, day=31, hour=0, minute=0, freq='S')
In [43]: b.toordinal()
Out[43]: 729389
In [44]: y.ordinal
Out[44]: 883526400 |
Some time ago I looked into this (for another issue), and then one of the fundamental problems was the design of pandas plotting for timeseries splitted in two ways: with datetimes when having irregular serieses, and with ordinals when having a regular timeseries (which was then converted to periodindex), and that those two types are incompatible with each other (so when you combine both types in a certain way it gives problems). But I have to dig it up again to fully remember (I have some overview of the problem somewhere, but never finished it). I don't know i I find some time in the short term, but will try. |
And the |
Yes, I looked at this a little last night, and agree with what you're saying - when you whittle the dataset down to just COL/PAN/VEN rows and then plot, COL and VEN get converted to have PeriodIndexes, but PAN stays with a DatetimeIndex for some reason, and then plotting them all on the same axes (via groupby) blows up somehow. |
Thanks. @rosnfeld you may want the |
Indeed, thanks! I actually hadn't seen that option before. It also fixes the "missing series" variant I mentioned in the original description. The bad xlim variant for the original dataset still remains, though. I'll try and dig into things a bit and see what can be done - I presume that not requiring x_compat is desirable. |
Yeah it would be desirable. May be tricky though. I'm guessing that argument was added for cases precisely like this one. |
For a bit more detail, I think this is what's going on: pandas uses special timeseries plotting if it can infer a "periodic" frequency from a series. While it takes a bit of digging, this is part of the def _make_plot(self):
# this is slightly deceptive
if not self.x_compat and self.use_index and self._use_dynamic_x():
data = self._maybe_convert_index(self.data)
self._make_ts_plot(data)
else:
... # regular plotting This special tseries logic converts plotted series to use a PeriodIndex, and sets a "base" version of the frequency on the axes object for later reference. (Note that x_compat disables all of this and uses regular, non-tseries plotting) The first series to be plotted in my dataset (COL) gets a frequency of '5A-DEC', which can be converted to a period. In the timeseries plotting code the "base" version of this frequency ('A-DEC') gets assigned to the axes object. The 3-item DatetimeIndex of 1997-12-31, 2003-12-31, and 2008-12-31 for the next series (PAN) has a surprising inferred frequency of 'WOM-5WED' since 1997, 2003, and 2008 all ended on a Wednesday (the 5th Wednesday of December). pandas can't convert frequencies like that to periods, so it uses regular plotting rather than the special timeseries plotting for that series, and its index is not converted to PeriodIndex. It doesn't try and use the axes frequency since it has already inferred a frequency for this series. The next and final series (VEN) does not have an inferred frequency, so it inherits the axes frequency, and tries to use tseries plotting again. tseries plotting tries to re-calculate x_lim's to include all data, so it looks at the lines already plotted, but it assumes all existing lines will have PeriodIndex data. It blows up when it tries to call 'ordinal' on the DatetimeIndex entries from the earlier (PAN) series. I'm not sure what the right fix is here. Frequency inference clearly makes some interesting choices, that are relied on in other parts of the codebase. I'm not sure if either the frequency inference or the usage of it should be modified. Timeseries plotting should maybe tolerate non-PeriodIndex data when calculating x_lim, though I don't yet understand much of that code yet, e.g. why PeriodIndex is desirable. |
@rosnfeld so this occurs when you have multiple series overlaid on the same plot and 1 is converted to PeriodIndex for display while 1 is not. can you edit the top of the post to make it easily copy-pastable for the failing case? |
I added an even simpler example at the start of the post. No groupby or anything, just plot a timeseries with an inferred frequency that can be converted to a period, then one that can't, then the first one again, all on the same axes, and you get the same stack trace. Hope that's along the lines of what you were looking for. |
ok...I guess the soln is in the plotting routines to check if their is a plot on the axis already that has a conflicting axis/index, then handle the current plotting better. I am not sure if this involves too much introspection or is even possible (e.g. you would have to get the index state from the axis and not sure if saved 'enough' to be able to figure out what is up) @rosnfeld give it a shot? |
moving to 0.15, but if you are able to figure out soon can move back |
Sure, I can take a shot at it. I'm optimistic something can be done. |
Well, regular vs irregular series have pretty different ordinals, as in @TomAugspurger comment above, so I think the problem is unfortunately deeper than just xlim/representation. A solution might be to rework |
@TomAugspurger push? |
@TomAugspurger status (pushing #7670) ok, so push this as well |
It looks like this issue is solved in the meantime. At least the simplified example at the top now works correctly for me. @rosnfeld Would you be able to test with your more complex example as well? |
Yes! I tested with the more complex example and everything works now. (as of 0.18.1) |
I see this is still open - should I close it? Or do people want some unit tests to explicitly try to protect against this happening again? Unfortunately given our (or at least my) incomplete understanding of why it was happening and how it has since been fixed, perhaps the best we could do would be writing a test that would have failed against code from a couple of years ago. Not sure what community practice is on things like this. |
Yes, I would first like to see a test added to confirm this (and keep it working!). A PR very welcome! |
validation tests, closes pandas-dev#6608.
After some discussion below, here's a simple repro case:
causes
-- ORIGINAL MESSAGE --
Here's a small dataset:
If I read this in using
and then try and plot it using
I get more or less what I expect (perhaps the xlim doesn't go up to 2010-12-31, but otherwise fine).
If I delete out the BRA rows and try this again, I get:
If I delete out both BRA and VEN rows, then there is no exception raised but I only see one series plotted and the x-axis is not formatted as a date.
One could also approach this whole exercise via something like
but this works even worse, I just get a truncated VEN series and nothing else.
This is with current master pandas (but also happens in 0.13.1) and matplotlib 1.3.1.
Are there known issues with plotting sparse-yet-overlapping timeseries?
The text was updated successfully, but these errors were encountered: