Skip to content

pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nehalecky opened this issue Apr 1, 2013 · 10 comments · Fixed by #25796
Closed

pd.concat loses frequency attribute for 'continuous' DataFrame appends #3232

nehalecky opened this issue Apr 1, 2013 · 10 comments · Fixed by #25796
Labels
Bug Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@nehalecky
Copy link
Contributor

Hey all,

I have a DataFrame (df) that stores live sensor data that is captured at a specific frequency. New raw data from sensor is updated at a set interval (an attempt at bandwidth conservation), which is parsed into a new df object.

These new update dataframes are of the same frequency, and contain data that is 'continuous' in time (i.e., they pick up right where the last timestamp left off), and ultimately I would like to append this new data to the existing dataframe while preserving the main dataframe frequency attribute. I tried by using a concat of old and new dataframes, however, it seems that concat doesn't check this case for continuous time series, and loses its frequency attribute. This can be reproduced in code below:

import pandas as pd
import numpy as np
dr = pd.date_range('01-Jan-2013', periods=100, freq='50L', tz='UTC')
df = pd.DataFrame(np.random.randn(100, 2), index=dr)
df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 00:00:04.950000]
Length: 100, Freq: 50L, Timezone: UTC

These guys look good:

#Preserves frequency
print df[:50].index
print df[50:].index
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 00:00:02.450000]
Length: 50, Freq: 50L, Timezone: UTC
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:02.500000, ..., 2013-01-01 00:00:04.950000]
Length: 50, Freq: 50L, Timezone: UTC

However, these guys, together, forget where they came from:

#Loses frequency
pd.concat([df[:50], df[50:]]).index
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 00:00:04.950000]
Length: 100, Freq: None, Timezone: UTC

I currently get around this with a resample of the resulting df to set frequency, which isn't that big of a deal, however, thought I'd mention it so that a more elegant behavior could be implemented. I'll try and take a look when I have time, but I know that all you here are so much more familiar with pandas internals. Any pointers?
And, as always, thank you! :)

@jreback
Copy link
Contributor

jreback commented Apr 2, 2013

@nehalecky as an aside, this is a good case for appending your data with an HDFStore , you then get query semantics (and its saved too!) http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables (and some more advanced examples http://pandas.pydata.org/pandas-docs/dev/cookbook.html#cookbook-hdf)

@nehalecky
Copy link
Contributor Author

Thanks @jreback for the note. I do use HDFStore when I persist data to my local machine, and it works great for that, however, the application I am referring to above is persisting data to a remote store, which will eventually have to scale horizontally. For both those reasons, the use of hdf5 isn't an option, unfortunately. :(

Actually, for preprocessing analysis, I am storing the raw record data as heavily compressed hdf5 binary in the db now (via pandas HDFStore). This allows me to retrieve individual records and load them directly to DataFrame, tying directly into my analysis stack, which is nice. I am really looking forward to whatever solutions are implemented for binary storage of data frame (#686 and all), but this is how I'm rolling for now. ;)

BTW, this writeup (http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables) is awesomeness, and your SO answer for pandas workflow is off the charts (http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas), thank you!

@jreback
Copy link
Contributor

jreback commented Apr 3, 2013

glad to here the docs are useful!

Heres another resource: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore

Contributions needed!

If you have the time, http://msgpack.org is probably a reasonable format (though doesn't support compression directly), is I think a good choice, and probably simple too implement (and db storagable).

@ghost
Copy link

ghost commented Apr 4, 2013

In the general case concat can join any two indices types, or PeriodIndex's of different freqs
or PeriodIndex's of identical freq but with an arbitrary gap, so what you're describing is
very much a special case. If the resample time is hurting you in terms of online performance
(doesn't sound like it), you can always keep the time signatures as a column rather then an index,
and perform the resample offline.

Another angle would rely on the fact that although resample and date_range are both O(N),
data_range (as you would expect) has much nicer constants, so concievavly you could
optimize by constructing a new dataframe with the values ndaddray and a
new index directly constructed from the timestamps of the inital and last sample.
That's if your have no timing variation, no dropped samples, and really need to
cut down processing time anyway you can.

@nehalecky
Copy link
Contributor Author

Hey @y-p, thanks for the tips.

Agreed it's a very special case, and I right now it isn't a major performance issue, however, when we begin to scale, it could be. In the meantime, I'll try and implement your suggestions and I'll keep you posted as to how this performs when we things begin to get bigger. :)

Thanks again.

@WillAyd
Copy link
Member

WillAyd commented Jul 6, 2018

Still an issue on 23.2

@WillAyd WillAyd modified the milestones: Someday, Contributions Welcome Jul 6, 2018
@WillAyd WillAyd added Datetime Datetime data dtype Timezones Timezone data dtype and removed Timezones Timezone data dtype labels Jul 6, 2018
@JoElfner
Copy link
Contributor

Still an issue in 0.23.4.
My current workaround is:
df.index.freq = df.index.inferred_freq
If the index and frequency were continuous before concat, this always seems to be working.
Compared with other methods like resampling or reindexing, this is several orders faster:

df.shape
# Out: (308160, 78)
%timeit df_full.index.freq = df_full.index.inferred_freq
# Out: 73.7 µs ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df_full.resample('1T').mean()
# Out: 947 ms ± 58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_full.reindex(pd.date_range(start=df_full.index[0], end=df_full.index[-1], freq='1T'))
# Out: 117 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So setting the frequency is about 13k times faster than resample and about 1.6k times faster than reindexing. If it is not known if the indices are contiguous, I'd thus go with reindex. Any opinions/advice on this?

@jbrockmendel
Copy link
Member

@TomAugspurger did the concat_same_type implementation end up checking if freq could be preserved?

@TomAugspurger
Copy link
Contributor

Nope, but it would certainly make a good standalone PR.

@nehalecky
Copy link
Contributor Author

Wow! Thanks @jreback and @mroeschke and everyone here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
6 participants