groupby, TimeGrouper error #8542

jgbarah · 2014-10-11T22:16:59Z

The following code seems to raise an error, since the result object does not make sense (well, at least to me):

In [61]: import datetime

In [62]: import pandas as pd

In [63]: df = pd.DataFrame.from_records ( [[datetime.datetime(2014,9,10),1234,"start"],
  [datetime.datetime(2013,10,10),1234,"start"]], columns = ["date", "change", "event"] )

In [64]: df
Out[64]: 
        date  change  event
0 2014-09-10    1234  start
1 2013-10-10    1234  start

In [65]: ts = df.set_index('date')

In [66]: ts
Out[66]: 
            change  event
date                     
2014-09-10    1234  start
2013-10-10    1234  start

In [67]: byperiod = ts.groupby([pd.TimeGrouper(freq="M"), "event"], as_index=False)

In [68]: byperiod.groups
Out[68]: 
{<pandas.tseries.resample.TimeGrouper at 0xab6bcaec>: [Timestamp('2014-09-10 00:00:00')],
 'event': [Timestamp('2013-10-10 00:00:00')]}

I would expect, for Out[68], two groups, one for each (date, event) pair.

Am I wring, or this is a bug?

The text was updated successfully, but these errors were encountered:

jreback · 2014-10-11T22:25:44Z

show pd.show_versions()

jgbarah · 2014-10-11T22:33:35Z

In [13]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 32
OS: Linux
OS-release: 3.14-2-686-pae
machine: i686
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.4
Cython: None
numpy: 1.9.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.3.0
sphinx: 1.2.3
patsy: 0.3.0
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.7
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.9
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
'''

jgbarah · 2014-10-11T22:35:30Z

BTW, with three elements in the dataframe, it seems to work:

In [8]: df = pd.DataFrame.from_records ( [[datetime.datetime(2014,9,10),1234,"start"], [datetime.datetime(2014,10,10),1238,"start"], [datetime.datetime(2014,12,10),1564,"start"]], columns = ["date", "change", "event"] )

In [9]: ts = df.set_index('date')

In [10]: byperiod = ts.groupby([pd.TimeGrouper(freq="M"), "event"], as_index=False)

In [11]: byperiod.groups
Out[11]: 
{(Timestamp('2014-09-30 00:00:00'),
  'start'): [Timestamp('2014-09-10 00:00:00')],
 (Timestamp('2014-10-31 00:00:00'),
  'start'): [Timestamp('2014-10-10 00:00:00')],
 (Timestamp('2014-12-31 00:00:00'),
  'start'): [Timestamp('2014-12-10 00:00:00')]}
'''

jreback · 2014-10-11T22:43:49Z

the prob with 2 elements is that the frequency is not defined for the timeseries

so maybe not behaving properly

will mark as a bug
feel free to have a look !

jgbarah · 2014-10-11T22:49:31Z

Thanks! I'm not familiar with the internals of pandas, but I'll try.

victorpoluceno · 2014-11-12T16:56:24Z

hey @jreback I found that check on the code that drives that behavior: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2077

The match_axis_length will be True because size of groups and keys are equal, thus this code will be executed:

if (not any_callable and not all_in_columns
        and not any_arraylike and match_axis_length
            and level is None):
    keys = [com._asarray_tuplesafe(keys)]

The thing is even if I remove match_axis_length from the above condition, all others checks will evaluate True on my case, and then keys = [com._asarray_tuplesafe(keys)] will execute anyway.

Do you have any ideas on how to relax/remove the match_axis_length and still keep the other cases?

jreback · 2014-11-15T17:09:10Z

I think a way to make this work is to relax the freqency inference engine to only require 2 dates (though this may break lots of other things, not sure).

In [10]: pd.DatetimeIndex(df['date'])
Out[10]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-10, 2013-10-10]
Length: 2, Freq: None, Timezone: None

alep · 2015-02-18T19:44:14Z

@victorpoluceno did you figure out a solution?

sinhrks · 2016-04-29T01:51:46Z

This looks to be fixed by current master. Adding tests and close.

byperiod.groups
# {(Timestamp('2013-10-31 00:00:00'), 'start'): [Timestamp('2013-10-10 00:00:00')],
#  (Timestamp('2014-09-30 00:00:00'), 'start'): [Timestamp('2014-09-10 00:00:00')]}

jreback added Bug Resample resample method labels Oct 11, 2014

jreback added this to the 0.15.1 milestone Oct 11, 2014

jreback mentioned this issue Nov 11, 2014

Weird behavior on group by TimeGrouper followed by agg #8789

Closed

jreback modified the milestones: 0.16.0, 0.15.2 Nov 30, 2014

alep mentioned this issue Feb 18, 2015

Unexpected output of table pivoting when there is little data; a TimeGrouper object is in the index. #9485

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

sinhrks mentioned this issue Apr 29, 2016

BUG/TST: TimeGrouper has erroneous groups if key length is too short #13028

Closed

3 tasks

jreback modified the milestones: 0.18.1, Next Major Release Apr 29, 2016

jreback closed this as completed in b56cea2 Apr 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby, TimeGrouper error #8542

groupby, TimeGrouper error #8542

jgbarah commented Oct 11, 2014

jreback commented Oct 11, 2014

jgbarah commented Oct 11, 2014

jgbarah commented Oct 11, 2014

jreback commented Oct 11, 2014

jgbarah commented Oct 11, 2014

victorpoluceno commented Nov 12, 2014

jreback commented Nov 15, 2014

alep commented Feb 18, 2015

sinhrks commented Apr 29, 2016 •

edited

Loading

groupby, TimeGrouper error #8542

groupby, TimeGrouper error #8542

Comments

jgbarah commented Oct 11, 2014

jreback commented Oct 11, 2014

jgbarah commented Oct 11, 2014

jgbarah commented Oct 11, 2014

jreback commented Oct 11, 2014

jgbarah commented Oct 11, 2014

victorpoluceno commented Nov 12, 2014

jreback commented Nov 15, 2014

alep commented Feb 18, 2015

sinhrks commented Apr 29, 2016 • edited Loading

sinhrks commented Apr 29, 2016 •

edited

Loading