Skip to content

Plotting 128Hz timeseries crashes #20575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alexlouden opened this issue Apr 2, 2018 · 9 comments
Open

Plotting 128Hz timeseries crashes #20575

alexlouden opened this issue Apr 2, 2018 · 9 comments
Labels

Comments

@alexlouden
Copy link

Code Sample

import numpy as np
import pandas as pd

start = 0
freq = 128
samples = freq * 10  # 10 seconds
data = np.random.random(samples)
index = pd.date_range(start, periods=samples, freq='{}S'.format(1/freq))
ts = pd.Series(data=data, index=index, name='High freq')
ts.plot()

Problem description

Plotting a 10 second, 128Hz timeseries uses up all my RAM then crashes. 100Hz works fine, 128/150Hz (6666666N) crashes.

screen shot 2018-04-02 at 1 43 25 pm

Expected Output

Ideally, a plot of my data. I wouldn't expect plotting 1280 points to require > 100GB of RAM! 😄

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Darwin OS-release: 17.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.0
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.2.1
matplotlib: 2.2.2

@jorisvandenbossche
Copy link
Member

For me it luckily gives directly a MemoryError without crashing ...

The reason for the large memory, is because it is trying to generate a huge arange:

-> 1110         data = np.arange(start.ordinal, end.ordinal + 1, mult, dtype=np.int64)
   1111 
   1112     return data, freq

ipdb> start.ordinal
-499609375
ipdb> end.ordinal
10491796875
ipdb> mult
1

What seems off in this case is the mult of 1. The reason it needs to work in nano-second range is because of the 'strange' frequency:

In [64]: ts.index.freq
Out[64]: <7812500 * Nanos>

But, then it should generate a range with a step of 7812500 instead of one. So somewhere in the plotting code, the multiplier for the freq is lost.

@alexlouden
Copy link
Author

Thanks for the update, and for having a look into it!

@fredrik-1
Copy link

I tried to look into this but I didn't get that far. I believe that it might be a feature of the handling of the time index (x-labels, ticklabels etc) that is implemented.

I don't think that it possibly that mult in
data = np.arange(start.ordinal, end.ordinal + 1, mult, dtype=np.int64)
can be 7812500. The actual frequency in the series is just not there in that part of the code. A frequency of 100 Hz also result in mult=1 and ten times longer data than necessary.

So I believe that the implementation might be correct but that it doesn't work in practice if it is necessary to use nano seconds in the calculations when the actual sampling period is much larger.

@jorisvandenbossche
Copy link
Member

@fredrik-1 Thanks for looking into it. I think you are right that it currently is just due to how it is implemented: if the freq is given in nanos, the frequency used for the plot axis is 1 nano, regardless of the how many nanos the actual frequency is.
That has the consequence that the plotting does not work with a freq of such a huge number of nanos.

@jorisvandenbossche
Copy link
Member

The question then is: how can this be solved?
We could look into whether it is possible to actually take the multiplier in the freq into account, but not sure what the consequences would be for that in general.

@fredrik-1
Copy link

I tried to look a little more on the code.

The problem or feature seems to be that TimeSeries_DateLocator (I don't know from where it is called, implicitely from matplotlib?) seems to use a freq variable which is a simple string with only the nano information. freq is then changed to be a pd.tseries.offsets.Nano object with n=1. The pd.tseries.offsets.Nano object in the actual data has n=7812500

@TomAugspurger
Copy link
Contributor

@fredrik-1 I haven't read this issue closely, but if it helps TimeSeries_DateLocator (

class TimeSeries_DateLocator(Locator):
) is registered as a units converter with matplotlib.

I don't know from where it is called, implicitely from matplotlib

Correct. I believe it's every time the figure is draw (so if it's interactively modified, it's called each time).

@fredrik-1
Copy link

fredrik-1 commented Apr 6, 2018

I tried to debug some more. It seems that the actual frequency data (the 100 before milli for example) is thrown away several times in the code by the use of
freq=freq.rule_code

I also found that .to_period() doesn't work on data with a "special" frequency.

import pandas as pd
import numpy as np

freq = 10
samples = freq * 1
data = np.random.random(samples)
index1 = pd.date_range(0, periods=samples, freq='{}S'.format(1/freq))
ts1 = pd.Series(data=data, index=index1, name='High freq')
index2 = pd.period_range(2000, periods=samples, freq='{}S'.format(1/freq))
ts2 = pd.Series(data=data, index=index2, name='High freq')
print(ts1.index)
print(ts2.index)    

works as expected but
ts1.to_period()
throws an error because "100L" is not in
"_offset_to_period_map" in \pandas\pandas_libs\tslibs\offsets.pyx

The first one below works. The second one works but the frequency in the actual index is not equal to freq in the freq object.

tsPeriod1=ts1.to_period(freq='100L')
print(tsPeriod1.index)
tsPeriod2=ts1.to_period(freq='L')
print(tsPeriod2.index)

Series with a DateTime or PeriodIndex with a frequency that is not equal to the standard frequencies, 1 second, 1 millisecond, 1 day etc don't seem to be supported in all functions.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 12, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants