Skip to content

pd.groupby(pd.TimeGrouper('W')).groups returns Timestamps instead of Periods #15141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jimbasquiat opened this issue Jan 16, 2017 · 15 comments
Closed

Comments

@jimbasquiat
Copy link

jimbasquiat commented Jan 16, 2017

Lets consider the following DataFrame:

date_range = pd.date_range(dt(2010,1,1), dt(2010,1,31), freq='1D')
df = pd.DataFrame(data = np.random.rand(len(date_range),2), index = date_range)

If I group the datapoints by periods of 1 week and visualize the groups definition, I get:

In: [1]:df.groupby(pd.TimeGrouper('W')).groups
Out:[1]: 
     {Timestamp('2010-01-03 00:00:00', freq='W-SUN'): 3,
     Timestamp('2010-01-10 00:00:00', freq='W-SUN'): 10,
     Timestamp('2010-01-17 00:00:00', freq='W-SUN'): 17,
     Timestamp('2010-01-24 00:00:00', freq='W-SUN'): 24,
     Timestamp('2010-01-31 00:00:00', freq='W-SUN'): 31}
I retrieve the keys of that dictionary:

In: [2]: list(df.groupby(pd.TimeGrouper('W')).keys())
Out:[2]: 
    [Timestamp('2010-01-03 00:00:00', freq='W-SUN'),
     Timestamp('2010-01-10 00:00:00', freq='W-SUN'),
     Timestamp('2010-01-31 00:00:00', freq='W-SUN'),
     Timestamp('2010-01-17 00:00:00', freq='W-SUN'),
     Timestamp('2010-01-24 00:00:00', freq='W-SUN')]

However I am left with those funny variables such as Timestamp('2010-01-24 00:00:00', freq='W-SUN') that have the prefix Timestamp but are structured like Periods. I think this is not correct, and those variables should be Periods?

@jorisvandenbossche
Copy link
Member

You are starting with a DatetimeIndex (so index of Timestamps), so also after grouping this will give you Timestamps. If you want periods, you can start with Periods (eg using period_range instead of date_range).
Or you can convert those Timestamps to Periods afterwards, if you want Periods:

In [82]: df.groupby(pd.TimeGrouper('W')).mean()
Out[82]: 
                   0         1
2010-01-03  0.639762  0.376022
2010-01-10  0.663659  0.543635
2010-01-17  0.286419  0.395908
2010-01-24  0.420077  0.470722
2010-01-31  0.440236  0.473810

In [83]: df.groupby(pd.TimeGrouper('W')).mean().to_period()
Out[83]: 
                              0         1
2009-12-28/2010-01-03  0.639762  0.376022
2010-01-04/2010-01-10  0.663659  0.543635
2010-01-11/2010-01-17  0.286419  0.395908
2010-01-18/2010-01-24  0.420077  0.470722
2010-01-25/2010-01-31  0.440236  0.473810

Note that in this case df.resample('W') instead of df.groupby(pd.TimeGrouper('W')) is actually nicer IMO.

@jimbasquiat
Copy link
Author

jimbasquiat commented Jan 16, 2017

OK thanks! If I do df.groupby(pd.TimeGrouper('W')).mean().to_period() I receive PeriodIndex(['2009-12-28/2010-01-03', '2010-01-04/2010-01-10', '2010-01-11/2010-01-17', '2010-01-18/2010-01-24', '2010-01-25/2010-01-31'], dtype='period[W-SUN]', freq='W-SUN') .I'm even more confused now. What does dtype='period[W-SUN]' stands for?

Also just as a "cosmetic" thing: I am only after the list of periods. I can do it with the solution you suggested: df.groupby(pd.TimeGrouper('W')).mean().to_period().index But would there be no other way passing by the .groups attribute? This would be more readable and consistent in terms of coding.

@jorisvandenbossche
Copy link
Member

What does dtype='period[W-SUN]'

See http://pandas.pydata.org/pandas-docs/stable/timeseries.html#anchored-offsets, this means 'weekly, but anchored at the sunday' (because your original timestamps that represented the full week were sundays). I have to admit that the documentation can certainly be improved here.

I am only after the list of periods. .. But would there be no other way passing by the .groups attribute?

Apparently, when grouping on a period index, then the .groups attribute does not work (this may be a bug). But starting from the timestamp keys, you could do:

In [26]: [p.to_period() for p in df.groupby(pd.TimeGrouper('W')).groups.keys()]
Out[26]: 
[Period('2010-01-04/2010-01-10', 'W-SUN'),
 Period('2010-01-25/2010-01-31', 'W-SUN'),
 Period('2010-01-11/2010-01-17', 'W-SUN'),
 Period('2010-01-18/2010-01-24', 'W-SUN'),
 Period('2009-12-28/2010-01-03', 'W-SUN')]

Although you probably need to think about why you need those, and if there is not a better method to do the same.

@jimbasquiat
Copy link
Author

jimbasquiat commented Jan 17, 2017

this means 'weekly, but anchored at the sunday'

This is not my question. My question is: "Why is it stated as a dtype for the timestamps?"

This was my question to begin with. If you look at my first post you can see the Timestamps are stated as follow: Timestamp('2010-01-03 00:00:00', freq='W-SUN'). Why would a Timestamp have a period as a dtype?

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

@jimbasquiat this is as expected.

In [9]: df.groupby(pd.TimeGrouper('W')).mean().to_period().index
Out[9]: PeriodIndex(['2009-12-28/2010-01-03', '2010-01-04/2010-01-10', '2010-01-11/2010-01-17', '2010-01-18/2010-01-24', '2010-01-25/2010-01-31'], dtype='period[W-SUN]', freq='W-SUN')

PeriodIndex has a pandas extension dtype. This indicates the frequency of the periods contained in it.

see the whatsnew here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#period-changes.

@jreback jreback closed this as completed Jan 17, 2017
@jreback jreback added the Period Period data type label Jan 17, 2017
@jreback jreback added this to the No action milestone Jan 17, 2017
@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

also note #12871 ; period resampling is in flux and being worked on. contributions are welcome.

@jimbasquiat
Copy link
Author

jimbasquiat commented Jan 17, 2017

I am not sure why I have to repeat my question for the 3rd time... I am not talking about a PeriodIndex but a Timestamp object. Do I have to copy-paste here my inital post?

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

@jimbasquiat that's correct as well. its simply a frequency attached to a Timestamp. not sure what the problem is.

@jimbasquiat
Copy link
Author

@jreback Is there any documentation somewhere about such Timestamp object? To me it makes little sense. The dtype for a Timestamp should be datetime64 or those "<M8..." no? A Timestamp with a frequency, is it not the definition of a Period?

@jorisvandenbossche jorisvandenbossche removed the Period Period data type label Jan 17, 2017
@jorisvandenbossche
Copy link
Member

Not sure if there is much information on the freq attribute for the individual Timestamp objects, but the frequency of the Timestamps is mainly of importance if you have a regular index of them:

In [8]: pd.date_range("2016-01-01", periods=3, freq='D')
Out[8]: DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03'], dtype='datetime64[ns]', freq='D')

In [9]: pd.date_range("2016-01-01", periods=3, freq='D')[0]
Out[9]: Timestamp('2016-01-01 00:00:00', freq='D')

So a accessing a single element here just kind of 'inherits' the frequency of the DatetimeIndex.

The Timestamp is a scalar and has no dtype attribute. But the dtype of an index or series of those values is certainly 'datetime64[ns]'.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

So a Timestamp with a freq is de-facto a Period.

In [2]: t = Timestamp('2010-01-03 00:00:00', freq='W-SUN')

In [3]: t
Out[3]: Timestamp('2010-01-03 00:00:00', freq='W-SUN')

In [4]: t.to_period()
Out[4]: Period('2009-12-28/2010-01-03', 'W-SUN')

In [5]: t.to_period().to_timestamp(how='start')
Out[5]: Timestamp('2009-12-28 00:00:00')

In [6]: t.to_period().to_timestamp(how='end')
Out[6]: Timestamp('2010-01-03 00:00:00')

However, I suspect that this is an implementation detail. IOW, resampling with Periods is still pretty buggy and these are prob just easier to deal with. Further in prior implementation of pandas (well actually currently as well, though a PR is in the works), Periods are much less performant than Timestamp, so it may be that a freq attached Timestamp is convenient.

@jimbasquiat I will create an issue specifically to address this. If you want to do some legwork to see if it is possible to instead return Periods where we now do Timestamp with a freq

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 17, 2017

to see if it is possible to instead return Periods where we now do Timestamp with a freq

Is that something we actually want? -> OK see your other issue, will comment there

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

@jorisvandenbossche I created #15146 for that very discussion.

my answer is if there are actually no differences then sure.

@jorisvandenbossche
Copy link
Member

@jreback yes, updated my comment above that I saw that and commented over there.

@jimbasquiat
Copy link
Author

Thank you guys for picking it up. In my earliest example with the groupby I think returning a Period would have made more sense. In the example you are giving with the Date Range probably not. I am not sure I can be of help on that topic as I am very new to Pandas (started learning it about 3 weeks ago). But thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants