Skip to content

KeyError: 0 error on groupby apply #30731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
venatir opened this issue Jan 6, 2020 · 4 comments
Closed

KeyError: 0 error on groupby apply #30731

venatir opened this issue Jan 6, 2020 · 4 comments

Comments

@venatir
Copy link

venatir commented Jan 6, 2020

Code Sample, a copy-pastable example if possible

def aggfunc(df):
  # operation that rely on df having the grouping column present.
  # Goes in again here without the grouping key and if my operation would rely on this, it would fail.
  return pd.Series([0.2,0.2], index=[12,13])

mydf=pd.DataFrame({"a":[datetime.datetime.today(),datetime.datetime.today()],"b":[1,2],"c":[5,6]})

mydf.groupby("a").apply(aggfunc)

Looks like groupby.apply crashes when using datetime aggregation and returning non-datetime data.

The problem is here: pandas.core.groupby.generic._recast_datetimelike_result
/pandas/core/groupby/generic.py:1857

obj_cols = [
        idx for idx in range(len(result.columns)) if is_object_dtype(result.dtypes[idx])
    ]

E.g. My result columns are 12,13 and this is trying to iterate through the 0,1 which is the range.

The code in /pandas/core/groupby/generic.py:1857 will fail with the above and an exception will be caught here: pandas/core/groupby/groupby.py:726. because of gh-20949 it is trying again without the grouping key. It should have worked from the beggining and this exception is not there to catch this kind of error.

The work around for this is to return a Series or DataFrame with the index reset, however this should not be a requirement.

The right way is to not use range in the _recast_datetimelike_result function.

Thank you

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.2.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 0.25.1
numpy            : 1.17.4
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.1.1
setuptools       : 42.0.2.post20191201
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.1
numexpr          : 2.6.9
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.3.2
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
@jreback
Copy link
Contributor

jreback commented Jan 6, 2020

pls reformat the top to only include a minimal reproducible example

your aggfunc is not defined

if you have commentary in the causes then put in another comment or clearly delineate this from the top

if it requires everyone reading the entire top section to grok then the likelihood of a response will be greatly decreased

@jorisvandenbossche
Copy link
Member

@venatir your code snippet works for me on pandas 0.25.3:

In [8]: import datetime                                                                                                                                                                                            

In [9]: def aggfunc(df): 
   ...:   # operation that rely on df having the grouping column present. 
   ...:   # Goes in again here without the grouping key and if my operation would rely on this, it would fail. 
   ...:   return pd.Series([0.2,0.2], index=[12,13]) 
   ...:  
   ...: mydf=pd.DataFrame({"a":[datetime.datetime.today(),datetime.datetime.today()],"b":[1,2],"c":[5,6]}) 
   ...:  
   ...: mydf.groupby("a").apply(aggfunc) 
   ...:                                                                                                                                                                                                            
Out[9]: 
                             12   13
a                                   
2020-01-14 09:30:25.647870  0.2  0.2
2020-01-14 09:30:25.648019  0.2  0.2

In [10]: pd.__version__                                                                                                                                                                                            
Out[10]: '0.25.3'

@MarcoGorelli
Copy link
Member

Closing as this works on v1.0.1 too, but please feel free to reopen if you have a failing reproducible example

@traubms
Copy link

traubms commented Jun 26, 2020

The example below reproduces the error in 0.25.3.

The bug occurs when:

  • the groupby function returns a series,
  • with multiple groups in groupby,
  • a datetimelike column somewhere in the input,
  • and the result series has integer index.
import pandas as pd
import datetime

def aggfunc(df):
    return pd.Series([0.2, 0.2], index=[12, 13])

df=pd.DataFrame({
    'a': datetime.datetime.today(),
    'b': [1, 2],
    'c': [5, 6],
})

df.drop(columns='a').groupby('b').apply(aggfunc)  # works as expected
df.groupby('b').apply(aggfunc)  # KeyError: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants