Skip to content

groupby, TimeGrouper error #8542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jgbarah opened this issue Oct 11, 2014 · 9 comments
Closed

groupby, TimeGrouper error #8542

jgbarah opened this issue Oct 11, 2014 · 9 comments
Labels
Bug Resample resample method
Milestone

Comments

@jgbarah
Copy link

jgbarah commented Oct 11, 2014

The following code seems to raise an error, since the result object does not make sense (well, at least to me):

In [61]: import datetime

In [62]: import pandas as pd

In [63]: df = pd.DataFrame.from_records ( [[datetime.datetime(2014,9,10),1234,"start"],
  [datetime.datetime(2013,10,10),1234,"start"]], columns = ["date", "change", "event"] )

In [64]: df
Out[64]: 
        date  change  event
0 2014-09-10    1234  start
1 2013-10-10    1234  start

In [65]: ts = df.set_index('date')

In [66]: ts
Out[66]: 
            change  event
date                     
2014-09-10    1234  start
2013-10-10    1234  start

In [67]: byperiod = ts.groupby([pd.TimeGrouper(freq="M"), "event"], as_index=False)

In [68]: byperiod.groups
Out[68]: 
{<pandas.tseries.resample.TimeGrouper at 0xab6bcaec>: [Timestamp('2014-09-10 00:00:00')],
 'event': [Timestamp('2013-10-10 00:00:00')]}

I would expect, for Out[68], two groups, one for each (date, event) pair.

Am I wring, or this is a bug?

@jreback
Copy link
Contributor

jreback commented Oct 11, 2014

show pd.show_versions()

@jgbarah
Copy link
Author

jgbarah commented Oct 11, 2014

In [13]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 32
OS: Linux
OS-release: 3.14-2-686-pae
machine: i686
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.4
Cython: None
numpy: 1.9.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.3.0
sphinx: 1.2.3
patsy: 0.3.0
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.7
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.9
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
'''

@jgbarah
Copy link
Author

jgbarah commented Oct 11, 2014

BTW, with three elements in the dataframe, it seems to work:

In [8]: df = pd.DataFrame.from_records ( [[datetime.datetime(2014,9,10),1234,"start"], [datetime.datetime(2014,10,10),1238,"start"], [datetime.datetime(2014,12,10),1564,"start"]], columns = ["date", "change", "event"] )

In [9]: ts = df.set_index('date')

In [10]: byperiod = ts.groupby([pd.TimeGrouper(freq="M"), "event"], as_index=False)

In [11]: byperiod.groups
Out[11]: 
{(Timestamp('2014-09-30 00:00:00'),
  'start'): [Timestamp('2014-09-10 00:00:00')],
 (Timestamp('2014-10-31 00:00:00'),
  'start'): [Timestamp('2014-10-10 00:00:00')],
 (Timestamp('2014-12-31 00:00:00'),
  'start'): [Timestamp('2014-12-10 00:00:00')]}
'''

@jreback
Copy link
Contributor

jreback commented Oct 11, 2014

the prob with 2 elements is that the frequency is not defined for the timeseries

so maybe not behaving properly

will mark as a bug
feel free to have a look !

@jreback jreback added Bug Resample resample method labels Oct 11, 2014
@jreback jreback added this to the 0.15.1 milestone Oct 11, 2014
@jgbarah
Copy link
Author

jgbarah commented Oct 11, 2014

Thanks! I'm not familiar with the internals of pandas, but I'll try.

@victorpoluceno
Copy link

hey @jreback I found that check on the code that drives that behavior: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2077

The match_axis_length will be True because size of groups and keys are equal, thus this code will be executed:

if (not any_callable and not all_in_columns
        and not any_arraylike and match_axis_length
            and level is None):
    keys = [com._asarray_tuplesafe(keys)]

The thing is even if I remove match_axis_length from the above condition, all others checks will evaluate True on my case, and then keys = [com._asarray_tuplesafe(keys)] will execute anyway.

Do you have any ideas on how to relax/remove the match_axis_length and still keep the other cases?

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

I think a way to make this work is to relax the freqency inference engine to only require 2 dates (though this may break lots of other things, not sure).

In [10]: pd.DatetimeIndex(df['date'])
Out[10]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-10, 2013-10-10]
Length: 2, Freq: None, Timezone: None

@alep
Copy link

alep commented Feb 18, 2015

@victorpoluceno did you figure out a solution?

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@sinhrks
Copy link
Member

sinhrks commented Apr 29, 2016

This looks to be fixed by current master. Adding tests and close.

byperiod.groups
# {(Timestamp('2013-10-31 00:00:00'), 'start'): [Timestamp('2013-10-10 00:00:00')],
#  (Timestamp('2014-09-30 00:00:00'), 'start'): [Timestamp('2014-09-10 00:00:00')]}

@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants