Skip to content

BUG: fix regression with SerieGrouper with Timestamp index (#42390) #42391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

philpep
Copy link

@philpep philpep commented Jul 5, 2021

closes #42390

This fixes a regression introduced in c355ed1 where cache is not
initialized with correct state of islider and vslider.

On Timestamp index this trigger a "ValueError Length of values does not match length of index"

return np.sum(series)

grouper = libreduction.SeriesGrouper(obj, agg, labels, 1)
result, counts = grouper.get_result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use assert_series_equal

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, you mean assert_equal ? Because result and counts here are not Series but np.array. I modified to use assert_equal instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_numpy_array_equal

pls let the question-asker hit the "resolve conversation" button

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I now use assert_numpy_array_equal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more to the point also do a full on test of the original example (the .agg)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more to the point also do a full on test of the original example (the .agg)

@philpep can you test the user-facing function here

@@ -275,6 +272,11 @@ cdef class SeriesGrouper(_BaseGrouper):
islider.move(start, end)
vslider.move(start, end)

if cached_index is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't this be done outside the loop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it seems the first call of islider.move() and vslider.move() is needed before initializing the cache (which use slider bufattribute), otherwise I get a series where size of index doesn't match size of values (see the non regression test).

I'm new to pandas code and didn't go in deep in this bug. It might be a better way to fix this, so expert eyes are welcome :) Especially I didn't not yet understand why this only occur with DatetimeIndex using UTC timezone (naive datetime don't trigger the bug).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a closer look. This file is pretty tricky.

return np.sum(series)

grouper = libreduction.SeriesGrouper(obj, agg, labels, 1)
result, counts = grouper.get_result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more to the point also do a full on test of the original example (the .agg)

…v#42390)

This fixes a regression introduced in c355ed1 where cache is not
initialized with correct state of islider and vslider. The first call of
{v,i}slider.move() must be done before initializing the cache.

On Timestamp index this trigger a "ValueError Length of values does not match length of index"

Closes pandas-dev#42390

Signed-off-by: Philippe Pepiot <[email protected]>
@jbrockmendel
Copy link
Member

OK, I think the underlying problem here is that we shouldn't be getting to libreduction at all when we have a dt64tz DTI. I think DatetimeIndex._has_complex_internals should be True for tzaware (possibly all of the datetimelike indexes, not totally sure)

@jreback jreback added Groupby Regression Functionality that used to work in a prior pandas version Datetime Datetime data dtype labels Jul 28, 2021
@jreback jreback added this to the 1.3.2 milestone Jul 28, 2021
@simonjayhawkins
Copy link
Member

OK, I think the underlying problem here is that we shouldn't be getting to libreduction at all when we have a dt64tz DTI. I think DatetimeIndex._has_complex_internals should be True for tzaware (possibly all of the datetimelike indexes, not totally sure)

see #42390 (comment)

@simonjayhawkins simonjayhawkins linked an issue Aug 9, 2021 that may be closed by this pull request
3 tasks
@jreback
Copy link
Contributor

jreback commented Aug 10, 2021

status here?

@simonjayhawkins
Copy link
Member

needs a release note and IIUC @jbrockmendel prefers to fix the underlying issue #42391 (comment)

@jbrockmendel 1.3.2 scheduled for end of this week, can you take a look. (There are several open issues on 1.3.2 so I think will schedule 1.3.3 for 3 weeks after 1.3.2 so no pressure)

@jbrockmendel
Copy link
Member

The example in #42390 is fixed by changing DatetimeTimedeltaMixin._has_complex_internals to return True. This will be a less-invasive alternative to ripping out SeriesBinGrouper which i expect we'll do for 1.4

@simonjayhawkins
Copy link
Member

closing this one as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Regression on SeriesGrouper using Timestamp index with pandas 1.3.0
4 participants