Weird behavior on group by TimeGrouper followed by agg #8789

victorpoluceno · 2014-11-11T20:42:15Z

I'm having problems with groupby with TimeGrouper followed by agg when DataFrame has two lines with unique values.

from datetime import datetime

import pandas as pd
import numpy as np

result = np.empty(3, dtype=[('name', 'object'),
                            ('value', 'int64'),
                            ('date', 'datetime64[us]')])
# doesn't work
result[0] = ('a', 1, datetime.now())
result[1] = ('b', 1, datetime.now())

# this works
#result[0] = ('a', 1, datetime.now())
#result[1] = ('a', 1, datetime.now())

# this also works
#result[0] = ('a', 1, datetime.now())
#result[1] = ('b', 1, datetime.now())
#result[2] = ('c', 1, datetime.now())

df = pd.DataFrame.from_records(result, index='date')
df = df[df['name'].map(lambda x: not x is None)] # filter blank series

group_list = ['name', pd.TimeGrouper(freq='%sS' % 60)]
grouped = df.groupby(group_list)
result = grouped.agg(['sum', 'count'])
print result.to_records()

The main idea is to group by name and date by resample it into 60 seconds intervals. On this example the desired output would be something like this:

[('a', Timestamp('2014-11-11 20:22:00'), 1, 1)
 ('b', Timestamp('2014-11-11 20:22:00'), 1, 1)]

But it returns this:

[(<pandas.tseries.resample.TimeGrouper object at 0x2fbea10>, 'b', 1, 1, 1)
 ('name', 'a', 1, 1, 1)]

Sometimes agg ignores data if there is nothing to aggregate on, this doesn't seems right to me.

show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.29.2.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.1
nose: None
Cython: 0.21
numpy: 1.9.1
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.2
pytz: 2014.9
bottleneck: None
tables: None
numexpr: 2.4
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2014-11-11T22:31:13Z

this is a dupe of #8542

the issue is that the freq is not valid (and hence is None) for only 2 samples (you need at least 3 to have a frequency). I think its possible to relax this, esp when resampling.

jreback · 2014-11-11T22:31:32Z

pull-requests are welcome to fix!

victorpoluceno · 2014-11-11T22:42:02Z

@jreback thanks. I'm going to give it a try and see if I can fix it.

jreback · 2014-11-11T22:58:45Z

awesome!
lmk if u need help

jreback closed this as completed Nov 11, 2014

jreback added Bug Resample resample method Duplicate Report Duplicate issue or pull request labels Nov 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird behavior on group by TimeGrouper followed by agg #8789

Weird behavior on group by TimeGrouper followed by agg #8789

victorpoluceno commented Nov 11, 2014

jreback commented Nov 11, 2014

jreback commented Nov 11, 2014

victorpoluceno commented Nov 11, 2014

jreback commented Nov 11, 2014

Weird behavior on group by TimeGrouper followed by agg #8789

Weird behavior on group by TimeGrouper followed by agg #8789

Comments

victorpoluceno commented Nov 11, 2014

jreback commented Nov 11, 2014

jreback commented Nov 11, 2014

victorpoluceno commented Nov 11, 2014

jreback commented Nov 11, 2014