-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: agg on groups with different sizes fails with out of bounds IndexError #35275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It seems that |
@valkum Thanks for the bug report! This is likely related to #33548. I don't think it has anything to do with group sizes, as this code produces the same out of bounds error: import numpy as np
import pandas as pd
data = {
'date': ['2000-01-01','2000-01-01', '2000-01-02', '2000-01-01', '2000-01-02'],
'team': ['client1', 'client1', 'client1', 'client2', 'client2'],
'temp': [0.780302, 0.780302, 0.035013, 0.355633, 0.243835],
}
df = pd.DataFrame( data )
df['date'] = pd.to_datetime(df['date'])
df = df.drop(df.index[1])
sampled=df.groupby('team').resample("1D", on='date')
#Returns IndexError
sampled.agg({'temp': np.mean})
#Returns IndexError as well
sampled['temp'].mean() Also Seeing as |
Thanks for your reply.
Its only when you drop a row after the DataFrame is created, and as you pointed out, the Index is not continous anymore. But I see that it might be related to #33548 |
Interesting. For me my code breaks both on UPDATE: ah, forgot to drop the second row. @valkum , could you run the updated code to make sure that it breaks, and that we aren't dealing with something super-weird? |
Investigated this a bit. The object we end up with is of class I'll try to track down this bug next week. |
take |
Interesting. The bug can be "fixed" by using a deep copy in |
Okay, so what happens is that The whole process is necessary, because we apply aggregation functions by creating shallow copies of Here is a link to the relevant code. As far as I can tell, we don't need to preserve the original row index before applying aggregation functions to a |
Thanks for your efforts. I might have found another bug which might be related to this where agg with a dict as arg will compute something different, but i am not sure. There is a similar issue open so I posted my PoC there #27343. |
Thanks for the info. I'll look deeper into these bugs this weekend. The improper sampling of |
@jreback I'd like to ask for a bit of help from the team with this one. Maybe you can see a way out of this bug or know someone who might be able to help with a groupby resampler issue? I diagnosed the problem, but hit a wall in fixing it. When we call aggregate functions on a column of a
The problem with fixing this mess is that the functionality is implemented in the inheritance chain, and I've so far been unable to fix it without breaking the Here is a minimal case to reproduce the bug: import pandas as pd
df = pd.DataFrame({'date' : [pd.to_datetime('2000-01-01')], 'group' : [1], 'value': [1]},
index=pd.DatetimeIndex(['2000-01-01']))
df.groupby('group').resample('1D', on='date')['value'].mean() This ends up throwing:
Deep down the call stack, we create a I'd appreciate some help with finding a viable approach here. Below is the full error traceback for this case:
```
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
in
1 df = pd.DataFrame({'date' : [pd.to_datetime('2000-01-01')], 'group' : [1], 'value': [1]},
2 index=pd.DatetimeIndex(['2000-01-01']))
----> 3 df.groupby('group').resample('1D', on='date')['value'].mean()
c:\git_contrib\pandas\pandas\pandas\core\resample.py in g(self, _method, *args, **kwargs) c:\git_contrib\pandas\pandas\pandas\core\resample.py in _apply(self, f, grouper, *args, **kwargs) c:\git_contrib\pandas\pandas\pandas\core\groupby\generic.py in apply(self, func, *args, **kwargs) c:\git_contrib\pandas\pandas\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs) c:\git_contrib\pandas\pandas\pandas\core\groupby\groupby.py in _python_apply_general(self, f, data) c:\git_contrib\pandas\pandas\pandas\core\groupby\ops.py in apply(self, f, data, axis) c:\git_contrib\pandas\pandas\pandas\core\resample.py in func(x) c:\git_contrib\pandas\pandas\pandas\core\base.py in _shallow_copy(self, obj, **kwargs) c:\git_contrib\pandas\pandas\pandas\core\resample.py in init(self, obj, groupby, axis, kind, **kwargs) c:\git_contrib\pandas\pandas\pandas\core\groupby\grouper.py in _set_grouper(self, obj, sort) c:\git_contrib\pandas\pandas\pandas\core\indexes\datetimelike.py in take(self, indices, axis, allow_fill, fill_value, **kwargs) c:\git_contrib\pandas\pandas\pandas\core\indexes\base.py in take(self, indices, axis, allow_fill, fill_value, **kwargs) c:\git_contrib\pandas\pandas\pandas\core\indexes\base.py in _assert_take_fillable(self, values, indices, allow_fill, fill_value, na_value) c:\git_contrib\pandas\pandas\pandas\core\arrays_mixins.py in take(self, indices, allow_fill, fill_value) c:\git_contrib\pandas\pandas\pandas\core\algorithms.py in take(arr, indices, axis, allow_fill, fill_value) IndexError: index 946684800000000000 is out of bounds for size 1
|
@AlexKirko havent looked closely but the issue is that you don't want to use .take too early that converts indexers (eg position in an index) to the index value itself we ideally want to convert only at the very end |
Makes sense, thanks. I'll try and look at the differences between calling aggregate functions on a |
Another example of this happening: df = pd.DataFrame({
'a': range(10),
'time': pd.date_range('2020-01-01', '2020-01-10', freq='D')
}) Using both groupby and resample: df.iloc[range(0, 10, 2)].groupby('a'.resample('D', on='time')['a'].mean() It fails with an IndexError:
Resetting the index before grouping gives the correct result: df.iloc[range(0, 10, 2)].reset_index().groupby('a').resample('D', on='time')['a'].mean()
|
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
See here as well: https://repl.it/@valkum/WrithingNotablePascal
Problem description
agg fails with
IndexError: index 3 is out of bounds for axis 0 with size 3
Note that this does work as expected when I do not drop a row after createing the DataFrame, so I assume it is caused by the index.
Expected Output
No fail.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-1009-gcp
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : 1.3.17
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: