pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232

nehalecky · 2013-04-01T21:26:55Z

Hey all,

I have a DataFrame (df) that stores live sensor data that is captured at a specific frequency. New raw data from sensor is updated at a set interval (an attempt at bandwidth conservation), which is parsed into a new df object.

These new update dataframes are of the same frequency, and contain data that is 'continuous' in time (i.e., they pick up right where the last timestamp left off), and ultimately I would like to append this new data to the existing dataframe while preserving the main dataframe frequency attribute. I tried by using a concat of old and new dataframes, however, it seems that concat doesn't check this case for continuous time series, and loses its frequency attribute. This can be reproduced in code below:

import pandas as pd
import numpy as np
dr = pd.date_range('01-Jan-2013', periods=100, freq='50L', tz='UTC')
df = pd.DataFrame(np.random.randn(100, 2), index=dr)
df.index

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 00:00:04.950000]
Length: 100, Freq: 50L, Timezone: UTC

These guys look good:

#Preserves frequency
print df[:50].index
print df[50:].index

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 00:00:02.450000]
Length: 50, Freq: 50L, Timezone: UTC
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:02.500000, ..., 2013-01-01 00:00:04.950000]
Length: 50, Freq: 50L, Timezone: UTC

However, these guys, together, forget where they came from:

#Loses frequency
pd.concat([df[:50], df[50:]]).index

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 00:00:04.950000]
Length: 100, Freq: None, Timezone: UTC

I currently get around this with a resample of the resulting df to set frequency, which isn't that big of a deal, however, thought I'd mention it so that a more elegant behavior could be implemented. I'll try and take a look when I have time, but I know that all you here are so much more familiar with pandas internals. Any pointers?
And, as always, thank you! :)

The text was updated successfully, but these errors were encountered:

jreback · 2013-04-02T12:22:34Z

@nehalecky as an aside, this is a good case for appending your data with an HDFStore , you then get query semantics (and its saved too!) http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables (and some more advanced examples http://pandas.pydata.org/pandas-docs/dev/cookbook.html#cookbook-hdf)

nehalecky · 2013-04-03T04:47:36Z

Thanks @jreback for the note. I do use HDFStore when I persist data to my local machine, and it works great for that, however, the application I am referring to above is persisting data to a remote store, which will eventually have to scale horizontally. For both those reasons, the use of hdf5 isn't an option, unfortunately. :(

Actually, for preprocessing analysis, I am storing the raw record data as heavily compressed hdf5 binary in the db now (via pandas HDFStore). This allows me to retrieve individual records and load them directly to DataFrame, tying directly into my analysis stack, which is nice. I am really looking forward to whatever solutions are implemented for binary storage of data frame (#686 and all), but this is how I'm rolling for now. ;)

BTW, this writeup (http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables) is awesomeness, and your SO answer for pandas workflow is off the charts (http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas), thank you!

jreback · 2013-04-03T09:59:17Z

glad to here the docs are useful!

Heres another resource: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore

Contributions needed!

If you have the time, http://msgpack.org is probably a reasonable format (though doesn't support compression directly), is I think a good choice, and probably simple too implement (and db storagable).

ghost · 2013-04-04T22:55:32Z

In the general case concat can join any two indices types, or PeriodIndex's of different freqs
or PeriodIndex's of identical freq but with an arbitrary gap, so what you're describing is
very much a special case. If the resample time is hurting you in terms of online performance
(doesn't sound like it), you can always keep the time signatures as a column rather then an index,
and perform the resample offline.

Another angle would rely on the fact that although resample and date_range are both O(N),
data_range (as you would expect) has much nicer constants, so concievavly you could
optimize by constructing a new dataframe with the values ndaddray and a
new index directly constructed from the timestamps of the inital and last sample.
That's if your have no timing variation, no dropped samples, and really need to
cut down processing time anyway you can.

nehalecky · 2013-04-23T18:18:18Z

Hey @y-p, thanks for the tips.

Agreed it's a very special case, and I right now it isn't a major performance issue, however, when we begin to scale, it could be. In the meantime, I'll try and implement your suggestions and I'll keep you posted as to how this performs when we things begin to get bigger. :)

Thanks again.

WillAyd · 2018-07-06T22:46:16Z

Still an issue on 23.2

JoElfner · 2018-11-15T12:13:02Z

Still an issue in 0.23.4.
My current workaround is:
df.index.freq = df.index.inferred_freq
If the index and frequency were continuous before concat, this always seems to be working.
Compared with other methods like resampling or reindexing, this is several orders faster:

df.shape
# Out: (308160, 78)
%timeit df_full.index.freq = df_full.index.inferred_freq
# Out: 73.7 µs ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df_full.resample('1T').mean()
# Out: 947 ms ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_full.reindex(pd.date_range(start=df_full.index[0], end=df_full.index[-1], freq='1T'))
# Out: 117 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So setting the frequency is about 13k times faster than resample and about 1.6k times faster than reindexing. If it is not known if the indices are contiguous, I'd thus go with reindex. Any opinions/advice on this?

jbrockmendel · 2018-11-16T03:10:15Z

@TomAugspurger did the concat_same_type implementation end up checking if freq could be preserved?

TomAugspurger · 2018-11-16T12:29:03Z

Nope, but it would certainly make a good standalone PR.

nehalecky · 2019-03-26T03:46:15Z

Wow! Thanks @jreback and @mroeschke and everyone here!

WillAyd modified the milestones: Someday, Contributions Welcome Jul 6, 2018

WillAyd added Datetime Datetime data dtype Timezones Timezone data dtype and removed Timezones Timezone data dtype labels Jul 6, 2018

mroeschke mentioned this issue Mar 20, 2019

BUG: pd.concat does not drop DatetimeIndex.freq when result is monotonic with even spacing #25796

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.25.0 Mar 22, 2019

jreback closed this as completed in #25796 Mar 26, 2019

PMeira mentioned this issue Jul 1, 2019

Power series all data does not return a frequency for the series nilmtk/nilmtk#624

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232

pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232

nehalecky commented Apr 1, 2013

jreback commented Apr 2, 2013

nehalecky commented Apr 3, 2013

jreback commented Apr 3, 2013

ghost commented Apr 4, 2013

nehalecky commented Apr 23, 2013

WillAyd commented Jul 6, 2018

JoElfner commented Nov 15, 2018

jbrockmendel commented Nov 16, 2018

TomAugspurger commented Nov 16, 2018

nehalecky commented Mar 26, 2019

pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232

pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232

Comments

nehalecky commented Apr 1, 2013

jreback commented Apr 2, 2013

nehalecky commented Apr 3, 2013

jreback commented Apr 3, 2013

ghost commented Apr 4, 2013

nehalecky commented Apr 23, 2013

WillAyd commented Jul 6, 2018

JoElfner commented Nov 15, 2018

jbrockmendel commented Nov 16, 2018

TomAugspurger commented Nov 16, 2018

nehalecky commented Mar 26, 2019