Skip to content

BUG: DatetimeIndexResamplerGroupby as_index interaction with .agg() #52397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
1112114641 opened this issue Apr 4, 2023 · 10 comments
Open
3 tasks done

BUG: DatetimeIndexResamplerGroupby as_index interaction with .agg() #52397

1112114641 opened this issue Apr 4, 2023 · 10 comments
Assignees
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Regression Functionality that used to work in a prior pandas version Timestamp pd.Timestamp and associated methods
Milestone

Comments

@1112114641
Copy link

1112114641 commented Apr 4, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# works
pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=False).resample("2min").min()

# works
pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=False).resample("2min").agg(min)

# works
pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=True).resample("2min").agg({"b":"min"})

# does not work
pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=False).resample("2min").agg({"b":"min"})

Issue Description

Using pandas 2.0.0 / python 3.9.13 running the above code yields different behaviour based on whether the as_index=False flag is set in the .groupby() method or not.

The resulting error message ValueError: Length of values (5) does not match length of index (30) is misleading, uninformative, and does not explain the difference in behaviour between .agg({"b":"min"}), and .agg(min). The same holds true for other standard .groupby aggregations (max, first, ...).

Could you please look into this?

Expected Behavior

.agg({"b":"min"}) and .agg(min) should yield the same result, irrespective of as_index=True/False.

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.9.13.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:28 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.0
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 63.3.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.9.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.3
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : 3.1.1
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@1112114641 1112114641 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023
@DeaMariaLeon DeaMariaLeon added Groupby Apply Apply, Aggregate, Transform, Map Timestamp pd.Timestamp and associated methods and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023
@szczekulskij
Copy link

Can I pick this up and get two weeks to dig into the codebase? Thanks

@luke396
Copy link
Contributor

luke396 commented Apr 9, 2023

@szczekulskij Just wanted to kindly remind you that commenting on 'take' will automatically assign you to this issue. No pressure though!

@1112114641
Copy link
Author

as an interesting aside, I discovered a new method of defining aggregations, and this code works also:

pd.DataFrame(
    {"a":np.repeat([0,1,2,3,4], 10),"b":range(50,100)}, 
    index=pd.date_range(start=pd.Timestamp.now(),freq="1min",periods=50)
).groupby("a",as_index=False).resample("2min").agg(b=("b","min"))

@szczekulskij
Copy link

take

@szczekulskij
Copy link

szczekulskij commented Apr 18, 2023

To keep this thread updated - I've identified the problem is in pandas/core/groupby/generic.py.
Line 421: if not self.as_index and not_indexed_same: - if this line doesn't result in True, everything works as expected

I'll now work on solving this

@szczekulskij
Copy link

Hey, could I ask for review on the attached PR ?
It definitely solves the issue, but I'm open for feedback - I'm still learning more about the repo

@rhshadrach rhshadrach added Regression Functionality that used to work in a prior pandas version Apply Apply, Aggregate, Transform, Map and removed Apply Apply, Aggregate, Transform, Map labels Apr 21, 2023
@rhshadrach
Copy link
Member

The last example in the OP doesn't raise on 1.5.x, marking as a regression.

@rhshadrach rhshadrach added this to the 2.0.1 milestone Apr 21, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Apr 23, 2023

It's not clear to me what the expected behavior of using as_index=False should be. I'm not finding any tests with groupby(...).resample(...) and as_index=False. In the docs, we say as_index has no impact on filtrations or transformations, but a resample doesn't fall into these categories.

cc @jbrockmendel @mroeschke @topper-123

@rhshadrach
Copy link
Member

A closer look at behavior, it appears to me as_index has no impact to resampler across methods. I think we stick with that for this issue, if we want to change that then we can take it up in a dedicated issue.

@szczekulskij
Copy link

Hey @rhshadrach, sorry for delay - updated now.
Happy to create a seperate ticket to implement a different functionality dependent on whether as_index=False (if I understood correctely, this is what you've discussed in the previous point)

Thanks for feedback and going w. me through this ticket

@datapythonista datapythonista modified the milestones: 2.0.2, 2.0.3 May 26, 2023
@lithomas1 lithomas1 modified the milestones: 2.0.3, 2.0.4 Jun 27, 2023
@lithomas1 lithomas1 removed this from the 2.0.4 milestone Aug 30, 2023
@lithomas1 lithomas1 added this to the 2.1.1 milestone Aug 30, 2023
@lithomas1 lithomas1 modified the milestones: 2.1.1, 2.1.2 Sep 21, 2023
@lithomas1 lithomas1 modified the milestones: 2.1.2, 2.1.3 Oct 26, 2023
@jorisvandenbossche jorisvandenbossche modified the milestones: 2.1.3, 2.1.4 Nov 13, 2023
@lithomas1 lithomas1 modified the milestones: 2.1.4, 2.2 Dec 8, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@lithomas1 lithomas1 modified the milestones: 2.2.2, 2.2.3 Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Regression Functionality that used to work in a prior pandas version Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

No branches or pull requests

8 participants