Skip to content

Weird behavior on group by TimeGrouper followed by agg #8789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
victorpoluceno opened this issue Nov 11, 2014 · 4 comments
Closed

Weird behavior on group by TimeGrouper followed by agg #8789

victorpoluceno opened this issue Nov 11, 2014 · 4 comments
Labels
Bug Duplicate Report Duplicate issue or pull request Resample resample method

Comments

@victorpoluceno
Copy link

I'm having problems with groupby with TimeGrouper followed by agg when DataFrame has two lines with unique values.

from datetime import datetime

import pandas as pd
import numpy as np

result = np.empty(3, dtype=[('name', 'object'),
                            ('value', 'int64'),
                            ('date', 'datetime64[us]')])
# doesn't work
result[0] = ('a', 1, datetime.now())
result[1] = ('b', 1, datetime.now())

# this works
#result[0] = ('a', 1, datetime.now())
#result[1] = ('a', 1, datetime.now())

# this also works
#result[0] = ('a', 1, datetime.now())
#result[1] = ('b', 1, datetime.now())
#result[2] = ('c', 1, datetime.now())

df = pd.DataFrame.from_records(result, index='date')
df = df[df['name'].map(lambda x: not x is None)] # filter blank series

group_list = ['name', pd.TimeGrouper(freq='%sS' % 60)]
grouped = df.groupby(group_list)
result = grouped.agg(['sum', 'count'])
print result.to_records()

The main idea is to group by name and date by resample it into 60 seconds intervals. On this example the desired output would be something like this:

[('a', Timestamp('2014-11-11 20:22:00'), 1, 1)
 ('b', Timestamp('2014-11-11 20:22:00'), 1, 1)]

But it returns this:

[(<pandas.tseries.resample.TimeGrouper object at 0x2fbea10>, 'b', 1, 1, 1)
 ('name', 'a', 1, 1, 1)]

Sometimes agg ignores data if there is nothing to aggregate on, this doesn't seems right to me.

show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.29.2.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.1
nose: None
Cython: 0.21
numpy: 1.9.1
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.2
pytz: 2014.9
bottleneck: None
tables: None
numexpr: 2.4
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

this is a dupe of #8542

the issue is that the freq is not valid (and hence is None) for only 2 samples (you need at least 3 to have a frequency). I think its possible to relax this, esp when resampling.

@jreback jreback closed this as completed Nov 11, 2014
@jreback jreback added Bug Resample resample method Duplicate Report Duplicate issue or pull request labels Nov 11, 2014
@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

pull-requests are welcome to fix!

@victorpoluceno
Copy link
Author

@jreback thanks. I'm going to give it a try and see if I can fix it.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

awesome!
lmk if u need help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Resample resample method
Projects
None yet
Development

No branches or pull requests

2 participants