Skip to content

BUG: groupby apply with head(1) raises keyerror with datetime grouper #29617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
endremborza opened this issue Nov 14, 2019 · 4 comments · Fixed by #35504
Closed

BUG: groupby apply with head(1) raises keyerror with datetime grouper #29617

endremborza opened this issue Nov 14, 2019 · 4 comments · Fixed by #35504
Assignees
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@endremborza
Copy link
Contributor

This is a very strange error with a long traceback. I found #15680 and #11324 mentioning similar things, but neither seem to cover the behavior here

Code Sample, a copy-pastable example:

import datetime
import pandas as pd

recs = [{'LIVE': 1,
         'ITEM': '001',
         'DATE': datetime.date(2019, 10, 1)},
        {'LIVE': 2,
         'ITEM': '002',
         'DATE': datetime.date(2019, 10, 2)},
        {'LIVE': 3,
         'ITEM': '003',
         'DATE': datetime.date(2019, 10, 1)}]

pd.DataFrame(recs).groupby(['ITEM', 'DATE']).apply(lambda df: df.head(1))

on 0.25.3 this raises this convoluted KeyError:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: Timestamp('2019-10-01 00:00:00')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)
    631                 try:
--> 632                     i = level.get_loc(key)
    633                 except KeyError:

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: Timestamp('2019-10-01 00:00:00')

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    724             try:
--> 725                 result = self._python_apply_general(f)
    726             except Exception:

~/.local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    744         return self._wrap_applied_output(
--> 745             keys, values, not_indexed_same=mutated or self.mutated
    746         )

~/.local/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _wrap_applied_output(self, keys, values, not_indexed_same)
    371         elif isinstance(v, DataFrame):
--> 372             return self._concat_objects(keys, values, not_indexed_same=not_indexed_same)
    373         elif self.grouper.groupings is not None:

~/.local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _concat_objects(self, keys, values, not_indexed_same)
    972                     names=group_names,
--> 973                     sort=False,
    974                 )

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    254         copy=copy,
--> 255         sort=sort,
    256     )

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    427 
--> 428         self.new_axes = self._get_new_axes()
    429 

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in _get_new_axes(self)
    521 
--> 522         new_axes[self.axis] = self._get_concat_axis()
    523         return new_axes

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in _get_concat_axis(self)
    577             concat_axis = _make_concat_multiindex(
--> 578                 indexes, self.keys, self.levels, self.names
    579             )

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)
    635                         "Key {key!s} not in level {level!s}".format(
--> 636                             key=key, level=level
    637                         )

ValueError: Key 2019-10-01 00:00:00 not in level Index([2019-10-01, 2019-10-02], dtype='object', name='DATE')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: Timestamp('2019-10-01 00:00:00')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)
    631                 try:
--> 632                     i = level.get_loc(key)
    633                 except KeyError:

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: Timestamp('2019-10-01 00:00:00')

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-32-cf7f934e7c7f> in <module>
     12          'DATE': datetime.date(2019, 10, 1)}]
     13 
---> 14 pd.DataFrame(recs).groupby(['ITEM', 'DATE']).apply(lambda df: df.head(1))

~/.local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    735 
    736                 with _group_selection_context(self):
--> 737                     return self._python_apply_general(f)
    738 
    739         return result

~/.local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    743 
    744         return self._wrap_applied_output(
--> 745             keys, values, not_indexed_same=mutated or self.mutated
    746         )
    747 

~/.local/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _wrap_applied_output(self, keys, values, not_indexed_same)
    370             return DataFrame()
    371         elif isinstance(v, DataFrame):
--> 372             return self._concat_objects(keys, values, not_indexed_same=not_indexed_same)
    373         elif self.grouper.groupings is not None:
    374             if len(self.grouper.groupings) > 1:

~/.local/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _concat_objects(self, keys, values, not_indexed_same)
    971                     levels=group_levels,
    972                     names=group_names,
--> 973                     sort=False,
    974                 )
    975             else:

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    253         verify_integrity=verify_integrity,
    254         copy=copy,
--> 255         sort=sort,
    256     )
    257 

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    426         self.copy = copy
    427 
--> 428         self.new_axes = self._get_new_axes()
    429 
    430     def get_result(self):

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in _get_new_axes(self)
    520                 new_axes[i] = ax
    521 
--> 522         new_axes[self.axis] = self._get_concat_axis()
    523         return new_axes
    524 

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in _get_concat_axis(self)
    576         else:
    577             concat_axis = _make_concat_multiindex(
--> 578                 indexes, self.keys, self.levels, self.names
    579             )
    580 

~/.local/lib/python3.7/site-packages/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)
    634                     raise ValueError(
    635                         "Key {key!s} not in level {level!s}".format(
--> 636                             key=key, level=level
    637                         )
    638                     )

ValueError: Key 2019-10-01 00:00:00 not in level Index([2019-10-01, 2019-10-02], dtype='object', name='DATE')

oddly, both of these work

pd.DataFrame(recs).groupby(['ITEM']).apply(lambda df: df.head(1))
pd.DataFrame(recs).groupby(['DATE']).apply(lambda df: df.head(1))

also, the behavior is the same if I modify recs so that only 1 distinct date is present

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.2.1-1.el7.elrepo.x86_64
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.10
tables : 3.6.1
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@endremborza endremborza changed the title BUG: goupby apply with head raises keyerror with datetime grouper BUG: groupby apply with head(1) raises keyerror with datetime grouper Nov 14, 2019
@hwalinga
Copy link
Contributor

You seem to be looking for .first(), instead of .apply(lambda df: df.head(1)). .first() will work as intended.

The problem you are facing is that groupby with two columns result in multiindex. df.head(1) will not return this multiindex.

@jbrockmendel jbrockmendel added Apply Apply, Aggregate, Transform, Map Groupby labels Nov 30, 2019
@mroeschke mroeschke added the Bug label Jun 28, 2020
@smithto1
Copy link
Member

This is fixed in 1.1.0 (still raises same error on 1.0.5). Not sure what the applicable fix was but I think this issue can be marked as closed @jreback .

@smithto1
Copy link
Member

smithto1 commented Aug 1, 2020

@endremborza the essence of this error is that you're using datetime.date for the dates rather than datetime.datetime or pd.Timestamp. The error message indicates it is trying to find a pd.Timestamp in an index of datetime.date and can't find it.

If you run the same code with datetime.datetime it works.

import datetime
import pandas as pd

recs = [{'LIVE': 1,
         'ITEM': '001',
         'DATE': datetime.datetime(2019, 10, 1)},
        {'LIVE': 2,
         'ITEM': '002',
         'DATE': datetime.datetime(2019, 10, 2)},
        {'LIVE': 3,
         'ITEM': '003',
         'DATE': datetime.datetime(2019, 10, 1)}]

pd.DataFrame(recs).groupby(['ITEM', 'DATE']).apply(lambda df: df.head(1))

This bug has persisted up to 1.0.5, but it is fixed in 1.1.0. Running the original code on 1.1.0 works and it keeps the dtype of datetime.date; treating them like any other kind of object without trying to convert them.

I will add a test to make sure that a grouping works with datetime.date and that the bug doesn't arise again.

@smithto1
Copy link
Member

smithto1 commented Aug 1, 2020

take

@rhshadrach rhshadrach added the Needs Tests Unit test(s) needed to prevent regressions label Aug 1, 2020
@jreback jreback added this to the 1.2 milestone Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
7 participants