BUG: agg on groups with different sizes fails with out of bounds IndexError #35275

valkum · 2020-07-14T14:12:20Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

See here as well: https://repl.it/@valkum/WrithingNotablePascal

import numpy as np
import pandas as pd

data = {
  'date': ['2000-01-01', '2000-01-02', '2000-01-01', '2000-01-02'],
  'team': ['client1', 'client1',  'client2', 'client2'],
  'temp': [0.780302, 0.035013, 0.355633, 0.243835],
}
df = pd.DataFrame( data )
df['date'] = pd.to_datetime(df['date'])

df = df.drop(df.index[1])
sampled=df.groupby('team').resample("1D", on='date')
#Returns IndexError
sampled.agg({'temp': np.mean})
#Returns IndexError as well
sampled['temp'].mean()

Problem description

agg fails with IndexError: index 3 is out of bounds for axis 0 with size 3

Note that this does work as expected when I do not drop a row after createing the DataFrame, so I assume it is caused by the index.

Expected Output

No fail.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-1009-gcp
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : 1.3.17
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

valkum · 2020-07-14T14:20:42Z

It seems that sampled=df.reset_index().groupby('team').resample("1D", on='date') fixes the issue, but I am not sure if this would still be considered a bug.

AlexKirko · 2020-07-15T06:11:19Z

@valkum Thanks for the bug report!

This is likely related to #33548. I don't think it has anything to do with group sizes, as this code produces the same out of bounds error:

import numpy as np
import pandas as pd

data = {
  'date': ['2000-01-01','2000-01-01', '2000-01-02', '2000-01-01', '2000-01-02'],
  'team': ['client1', 'client1', 'client1',  'client2', 'client2'],
  'temp': [0.780302, 0.780302, 0.035013, 0.355633, 0.243835],
}
df = pd.DataFrame( data )
df['date'] = pd.to_datetime(df['date'])
df = df.drop(df.index[1])

sampled=df.groupby('team').resample("1D", on='date')

#Returns IndexError
sampled.agg({'temp': np.mean})
#Returns IndexError as well
sampled['temp'].mean()

Also sampled.mean() works, it's only sampled['temp'].mean() that breaks.

Seeing as reset_index fixes it, maybe the break in the index causes the bug.

valkum · 2020-07-15T12:11:47Z

Thanks for your reply.

sampled.agg(np.mean) works too, only when you try to a pass a dict (to only cover specific columns) it breaks.
Furthermore your example does work for me with out an out of bounds error, but creates different results nevertheless. See here

Its only when you drop a row after the DataFrame is created, and as you pointed out, the Index is not continous anymore.
So it is somehow a bug caused by non-continous indices combined with selecting an aggregation function on specific columns (either bei sampled['temp'].mean() or sampled.agg({'temp': np.mean}))

But I see that it might be related to #33548

AlexKirko · 2020-07-15T13:21:22Z

Interesting. For me my code breaks both on 1.0.5 and on the latest commit of master.

UPDATE: ah, forgot to drop the second row. @valkum , could you run the updated code to make sure that it breaks, and that we aren't dealing with something super-weird?

AlexKirko · 2020-07-17T07:02:18Z

Investigated this a bit. The object we end up with is of class pandas.core.resample.DatetimeIndexResamplerGroupby, which is a non-transparent descendant of GroupByMixin and DatetimeIndexResampler , and uncovering what exactly is causing bugs when using aggregate functions is non-trivial.

I'll try to track down this bug next week.

AlexKirko · 2020-07-17T07:02:27Z

take

AlexKirko · 2020-07-18T09:28:32Z

Interesting. The bug can be "fixed" by using a deep copy in _apply in _GroupByMixin. We must be forgetting something when creating a shallow copy, which causes _set_grouper to crash. Will keep investigating.

AlexKirko · 2020-07-20T08:25:11Z

Okay, so what happens is that df.index values get used deep down the call stack to draw dates from the DatetimeIndex that the grouping and resampling operations create. This is done through Index.take, and because the DatetimeIndex has only four elements in it, and we are trying to get the element with index 4, we get a KeyError. This is why resetting the index fixes this.

The whole process is necessary, because we apply aggregation functions by creating shallow copies of Series objects and applying the functions to them.

Here is a link to the relevant code.

As far as I can tell, we don't need to preserve the original row index before applying aggregation functions to a DatetimeIndexResamplerGroupby, so the obvious way would be to reset the index somewhere down the call stack to be safe. I'll see if I can find a good candidate spot.

valkum · 2020-07-21T12:56:32Z

Thanks for your efforts. I might have found another bug which might be related to this where agg with a dict as arg will compute something different, but i am not sure. There is a similar issue open so I posted my PoC there #27343.

AlexKirko · 2020-07-23T06:53:42Z

Thanks for the info. I'll look deeper into these bugs this weekend. The improper sampling of Datetime using the DataFrame.index as nparray.index probably has multiple effects (so it might be causing multiple bugs), but it's difficult to say until we think of a decent way to fix it and implement it.

AlexKirko · 2020-07-31T11:44:33Z

@jreback I'd like to ask for a bit of help from the team with this one. Maybe you can see a way out of this bug or know someone who might be able to help with a groupby resampler issue? I diagnosed the problem, but hit a wall in fixing it.

When we call aggregate functions on a column of a DatetimeIndexResamplerGroupby instance that is resampled on a date column, we end up drawing dates with DatetimeIndex.take, and the values we pass to it are taken from the index of the original DataFrame. This mechanism leads to two things:

If the original DataFrame.index is anything except a RangeIndex starting with 0, the thing breaks with an index error. So if we drop an index as OP did, or if the DataFrame is indexed with a DatetimeIndex, as in the example below, nothing works.
What we probably want when we apply an aggregate function to a ResamplerGroupby subtype is to get data that's grouped by the groupby columns and then by the resampling frequency of the resampler. What we end up with instead is that for each groupby group the code attempts to resample the data with take and then collapse it into one number with the aggregate function.

The problem with fixing this mess is that the functionality is implemented in the inheritance chain, and I've so far been unable to fix it without breaking the Resampler class in horrible ways.

Here is a minimal case to reproduce the bug:

import pandas as pd

df = pd.DataFrame({'date' : [pd.to_datetime('2000-01-01')], 'group' : [1], 'value': [1]},
                  index=pd.DatetimeIndex(['2000-01-01']))
df.groupby('group').resample('1D', on='date')['value'].mean()

This ends up throwing:

index 946684800000000000 is out of bounds for size 1

Deep down the call stack, we create a DatetimeIndex based on the date column and then we call DatetimeIndex.take on it passing values from df.index.

I'd appreciate some help with finding a viable approach here.

Below is the full error traceback for this case:

``` --------------------------------------------------------------------------- IndexError Traceback (most recent call last) in 1 df = pd.DataFrame({'date' : [pd.to_datetime('2000-01-01')], 'group' : [1], 'value': [1]}, 2 index=pd.DatetimeIndex(['2000-01-01'])) ----> 3 df.groupby('group').resample('1D', on='date')['value'].mean()

c:\git_contrib\pandas\pandas\pandas\core\resample.py in g(self, _method, *args, **kwargs)
935 def g(self, _method=method, *args, **kwargs):
936 nv.validate_resampler_func(_method, args, kwargs)
--> 937 return self._downsample(_method)
938
939 g.doc = getattr(GroupBy, method).doc

c:\git_contrib\pandas\pandas\pandas\core\resample.py in _apply(self, f, grouper, *args, **kwargs)
990 return x.apply(f, *args, **kwargs)
991
--> 992 result = self._groupby.apply(func)
993 return self._wrap_result(result)
994

c:\git_contrib\pandas\pandas\pandas\core\groupby\generic.py in apply(self, func, *args, **kwargs)
224 )
225 def apply(self, func, *args, **kwargs):
--> 226 return super().apply(func, *args, **kwargs)
227
228 @doc(

c:\git_contrib\pandas\pandas\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs)
857 with option_context("mode.chained_assignment", None):
858 try:
--> 859 result = self._python_apply_general(f, self._selected_obj)
860 except TypeError:
861 # gh-20949

c:\git_contrib\pandas\pandas\pandas\core\groupby\groupby.py in _python_apply_general(self, f, data)
890 data after applying f
891 """
--> 892 keys, values, mutated = self.grouper.apply(f, data, self.axis)
893
894 return self._wrap_applied_output(

c:\git_contrib\pandas\pandas\pandas\core\groupby\ops.py in apply(self, f, data, axis)
211 # group might be modified
212 group_axes = group.axes
--> 213 res = f(group)
214 if not _is_indexed_like(res, group_axes):
215 mutated = True

c:\git_contrib\pandas\pandas\pandas\core\resample.py in func(x)
983
984 def func(x):
--> 985 x = self._shallow_copy(x, groupby=self.groupby)
986
987 if isinstance(f, str):

c:\git_contrib\pandas\pandas\pandas\core\base.py in _shallow_copy(self, obj, **kwargs)
587 if attr not in kwargs:
588 kwargs[attr] = getattr(self, attr)
--> 589 return self._constructor(obj, **kwargs)
590
591

c:\git_contrib\pandas\pandas\pandas\core\resample.py in init(self, obj, groupby, axis, kind, **kwargs)
92
93 if self.groupby is not None:
---> 94 self.groupby._set_grouper(self._convert_obj(obj), sort=True)
95
96 def str(self) -> str:

c:\git_contrib\pandas\pandas\pandas\core\groupby\grouper.py in _set_grouper(self, obj, sort)
340 obj, ABCSeries
341 ):
--> 342 ax = self._grouper.take(obj.index)
343 else:
344 if key not in obj._info_axis:

c:\git_contrib\pandas\pandas\pandas\core\indexes\datetimelike.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
189
190 return ExtensionIndex.take(
--> 191 self, indices, axis, allow_fill, fill_value, **kwargs
192 )
193

c:\git_contrib\pandas\pandas\pandas\core\indexes\base.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
706 allow_fill=allow_fill,
707 fill_value=fill_value,
--> 708 na_value=self._na_value,
709 )
710 else:

c:\git_contrib\pandas\pandas\pandas\core\indexes\base.py in _assert_take_fillable(self, values, indices, allow_fill, fill_value, na_value)
736 )
737 else:
--> 738 taken = values.take(indices)
739 return taken
740

c:\git_contrib\pandas\pandas\pandas\core\arrays_mixins.py in take(self, indices, allow_fill, fill_value)
41
42 new_data = take(
---> 43 self._ndarray, indices, allow_fill=allow_fill, fill_value=fill_value,
44 )
45 return self._from_backing_data(new_data)

c:\git_contrib\pandas\pandas\pandas\core\algorithms.py in take(arr, indices, axis, allow_fill, fill_value)
1580 else:
1581 # NumPy style
-> 1582 result = arr.take(indices, axis=axis)
1583 return result
1584

IndexError: index 946684800000000000 is out of bounds for size 1

</details>

jreback · 2020-07-31T14:12:37Z

@AlexKirko havent looked closely but the issue is that you don't want to use .take too early that converts indexers (eg position in an index) to the index value itself

we ideally want to convert only at the very end

AlexKirko · 2020-07-31T18:31:03Z

Makes sense, thanks. I'll try and look at the differences between calling aggregate functions on a ResamplerGroupby without selecting a column (which works) and with it (which ends up passing original DataFrame index values to take and breaks). Maybe that will help.

FilipeTeixeira-TomTom · 2021-05-03T15:15:25Z

Another example of this happening:

df = pd.DataFrame({
    'a': range(10),
    'time': pd.date_range('2020-01-01', '2020-01-10', freq='D')
})

Using both groupby and resample:

df.iloc[range(0, 10, 2)].groupby('a'.resample('D', on='time')['a'].mean()

It fails with an IndexError:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../lib/python3.8/site-packages/pandas/core/resample.py", line 968, in g
    return self._downsample(_method)
  File ".../lib/python3.8/site-packages/pandas/core/resample.py", line 1024, in _apply
    result = self._groupby.apply(func)
  File ".../lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 221, in apply
    return super().apply(func, *args, **kwargs)
  File ".../lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 894, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File ".../lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 928, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File ".../lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 238, in apply
    res = f(group)
  File ".../lib/python3.8/site-packages/pandas/core/resample.py", line 1017, in func
    x = self._shallow_copy(x, groupby=self.groupby)
  File ".../lib/python3.8/site-packages/pandas/core/groupby/base.py", line 31, in _shallow_copy
    return self._constructor(obj, **kwargs)
  File ".../lib/python3.8/site-packages/pandas/core/resample.py", line 103, in __init__
    self.groupby._set_grouper(self._convert_obj(obj), sort=True)
  File ".../lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 362, in _set_grouper
    ax = self._grouper.take(obj.index)
  File ".../lib/python3.8/site-packages/pandas/core/indexes/datetimelike.py", line 208, in take
    result = NDArrayBackedExtensionIndex.take(
  File ".../lib/python3.8/site-packages/pandas/core/indexes/base.py", line 751, in take
    taken = algos.take(
  File ".../lib/python3.8/site-packages/pandas/core/algorithms.py", line 1657, in take
    result = arr.take(indices, axis=axis)
  File ".../lib/python3.8/site-packages/pandas/core/arrays/_mixins.py", line 71, in take
    new_data = take(
  File ".../lib/python3.8/site-packages/pandas/core/algorithms.py", line 1657, in take
    result = arr.take(indices, axis=axis)
IndexError: index 6 is out of bounds for axis 0 with size 5

Resetting the index before grouping gives the correct result:

df.iloc[range(0, 10, 2)].reset_index().groupby('a').resample('D', on='time')['a'].mean()

a  time      
0  2020-01-01    0
2  2020-01-03    2
4  2020-01-05    4
6  2020-01-07    6
8  2020-01-09    8
Name: a, dtype: int64

valkum added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 14, 2020

AlexKirko added Groupby Resample resample method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 15, 2020

github-actions bot assigned AlexKirko Jul 17, 2020

AlexKirko mentioned this issue Jul 18, 2020

BUG: Combination of groupby.resample.interpolate() fails #35325

Closed

3 tasks

valkum mentioned this issue Jul 21, 2020

DataFrameGroupby.resample with the on keyword does not produce the same output as on DateTimeIndex #27343

Open

AlexKirko removed their assignment Jul 31, 2020

phofl mentioned this issue Sep 7, 2020

[BUG]: Groupy and Resample miscalculated aggregation #36198

Closed

7 tasks

jreback mentioned this issue Nov 19, 2020

API/BUG: DatetimeIndex.argsort does not match DatetimeArray.argsort #37863

Closed

mroeschke added the Apply Apply, Aggregate, Transform, Map label Aug 8, 2021

Vicent-Ribas mentioned this issue Aug 31, 2023

Failed rasterization brycefrank/pyfor#79

Open

knowecho mentioned this issue Jul 30, 2024

BUG: groupby then resample on column gives incorrect results if the index is out of order #59350

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: agg on groups with different sizes fails with out of bounds IndexError #35275

BUG: agg on groups with different sizes fails with out of bounds IndexError #35275

valkum commented Jul 14, 2020 •

edited

Loading

INSTALLED VERSIONS

valkum commented Jul 14, 2020

AlexKirko commented Jul 15, 2020 •

edited

Loading

valkum commented Jul 15, 2020

AlexKirko commented Jul 15, 2020 •

edited

Loading

AlexKirko commented Jul 17, 2020

AlexKirko commented Jul 17, 2020

AlexKirko commented Jul 18, 2020

AlexKirko commented Jul 20, 2020 •

edited

Loading

valkum commented Jul 21, 2020

AlexKirko commented Jul 23, 2020

AlexKirko commented Jul 31, 2020

jreback commented Jul 31, 2020

AlexKirko commented Jul 31, 2020

FilipeTeixeira-TomTom commented May 3, 2021

BUG: agg on groups with different sizes fails with out of bounds IndexError #35275

BUG: agg on groups with different sizes fails with out of bounds IndexError #35275

Comments

valkum commented Jul 14, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

valkum commented Jul 14, 2020

AlexKirko commented Jul 15, 2020 • edited Loading

valkum commented Jul 15, 2020

AlexKirko commented Jul 15, 2020 • edited Loading

AlexKirko commented Jul 17, 2020

AlexKirko commented Jul 17, 2020

AlexKirko commented Jul 18, 2020

AlexKirko commented Jul 20, 2020 • edited Loading

valkum commented Jul 21, 2020

AlexKirko commented Jul 23, 2020

AlexKirko commented Jul 31, 2020

jreback commented Jul 31, 2020

AlexKirko commented Jul 31, 2020

FilipeTeixeira-TomTom commented May 3, 2021

valkum commented Jul 14, 2020 •

edited

Loading

Output of `pd.show_versions()`

AlexKirko commented Jul 15, 2020 •

edited

Loading

AlexKirko commented Jul 15, 2020 •

edited

Loading

AlexKirko commented Jul 20, 2020 •

edited

Loading