Skip to content

Groupby creates emptu groups depending on base parameter #25161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LucaAmerio opened this issue Feb 5, 2019 · 4 comments · Fixed by #26240
Closed

Groupby creates emptu groups depending on base parameter #25161

LucaAmerio opened this issue Feb 5, 2019 · 4 comments · Fixed by #26240
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Resample resample method
Milestone

Comments

@LucaAmerio
Copy link

Code Sample, a copy-pastable example if possible

"Generate test dataframe"
case = 1
if case == 1:
    start = '2018-11-26 16:17:43.510000'
else:
    start = '2018-11-26 16:17:43.500000'

rng = pd.date_range(start, periods=10, freq='1S')
df = pd.DataFrame({'a':np.random.randn(len(rng)),'b':np.random.randn(len(rng))}, index=rng)

"Set interval and start time of the buckets"
interval = dt.timedelta(minutes=10)
t0 = df.index[0]
base = t0.minute + (t0.second +t0.microsecond/1e6)/60
groups = df.groupby(pd.Grouper(freq=interval, base=base))

print(groups.size())

Problem description

The code above generates either 6 or 7 groups depending if the dataframe starts at '2018-11-26 16:17:43.500000' (case 1) or '2018-11-26 16:17:43.510000' (case 2).

The correct output is clearly the one obtained in case 2. Case 1, instead, creates an empty group at the end of the dataframe. This can cause troubles with groupby.apply() if the applied function does not handle empty dataframes.

Actual Output

2018-11-26 16:17:43.510 10
2018-11-26 16:27:43.510 #0
dtype: int64

Expected Output

2018-11-26 16:17:43.510 10
dtype: int64

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@mroeschke
Copy link
Member

This appears fixed in 0.24.x.

In [10]: print(groups.size())
2018-11-26 16:17:43.510    10
Freq: 10T, dtype: int64

In [11]: pd.__version__
Out[11]: '0.24.1'

I don't think base is tested too much so this would be a nice test to have.

@mroeschke mroeschke added Resample resample method good first issue Needs Tests Unit test(s) needed to prevent regressions labels Feb 5, 2019
@LucaAmerio
Copy link
Author

LucaAmerio commented Feb 5, 2019

base is almost undocumented, despite being a very useful option. To be honest I find quite "curious" the choice of it being a float representing minutes. Couldn't it be a datetime representing the starting instant in time of the grouping?

@mroeschke
Copy link
Member

I am guessing that choice was made because the intention was to set the "base" relative to the resampling frequency. Feel free to open another issue to discuss the possibility to extending the base argument.

@ihsansecer
Copy link
Contributor

Hi, could I create a test function for this problem? That would be my first issue.

ihsansecer added a commit to ihsansecer/pandas that referenced this issue Apr 29, 2019
ihsansecer added a commit to ihsansecer/pandas that referenced this issue May 1, 2019
Use resample() which is what Grouper calls, assert index instead of result of size()
ihsansecer added a commit to ihsansecer/pandas that referenced this issue May 2, 2019
@jreback jreback added this to the 0.25.0 milestone May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants