Skip to content

Groupby indices error with datetime categorical #26859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alexifm opened this issue Jun 14, 2019 · 6 comments · Fixed by #41712
Closed

Groupby indices error with datetime categorical #26859

alexifm opened this issue Jun 14, 2019 · 6 comments · Fixed by #41712
Assignees
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@alexifm
Copy link

alexifm commented Jun 14, 2019

Code Sample

df = pd.DataFrame({
    'a': pd.Series(list('abc')),
    'b': pd.Series(pd.to_datetime(['2018-01-01', '2018-02-01', '2018-03-01']), dtype='category'),
    'c': pd.Categorical.from_codes([-1, 0, 1], categories=[0, 1])
})

df.groupby(['a', 'b']).indices

Problem description

Tossing an error. You can play around with difference choices of columns but this happens so long as you include 'b' with one of the other columns. 'b' on its own is okay.

>> df.groupby(['a', 'b']).indices
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-c4de90de974e> in <module>
      1 gb = df.groupby(['a', 'b'])
----> 2 gb.indices

/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in indices(self)
    401         """
    402         self._assure_grouper()
--> 403         return self.grouper.indices
    404 
    405     def _get_indices(self, names):

pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()

/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/ops.py in indices(self)
    204             keys = [com.values_from_object(ping.group_index)
    205                     for ping in self.groupings]
--> 206             return get_indexer_dict(label_list, keys)
    207 
    208     @property

/opt/conda/lib/python3.6/site-packages/pandas/core/sorting.py in get_indexer_dict(label_list, keys)
    331     group_index = group_index.take(sorter)
    332 
--> 333     return lib.indices_fast(sorter, group_index, keys, sorted_labels)
    334 
    335 

pandas/_libs/lib.pyx in pandas._libs.lib.indices_fast()

TypeError: Cannot convert DatetimeIndex to numpy.ndarray

Expected Output

Not an error.

Cause

If we inspect, BaseGrouper.indices, we see that keys gets passed to get_indexer_dict here:

def indices(self):
""" dict {group name -> group indices} """
if len(self.groupings) == 1:
return self.groupings[0].indices
else:
label_list = [ping.labels for ping in self.groupings]
keys = [com.values_from_object(ping.group_index)
for ping in self.groupings]
return get_indexer_dict(label_list, keys)

get_indexer_dict eventually passes the elements of keys to get_value_at found here:

cdef inline object get_value_at(ndarray arr, object loc):
cdef:
Py_ssize_t i
i = validate_indexer(arr, loc)
return arr[i]

The problem is that to build keys, the get_values method is called on each group index (you can see in BaseGrouper.indices how this isn't an issue when there's a single grouper). When grouping on a categorical-datetime column like df['b'], the get_values method on the underlying categorical array is called and within that method this branch of the if statement is triggered, causing a DatetimeIndex to be returned instead of a numpy array.

return self.categories.take(self._codes, fill_value=np.nan)

Solution

Now, it states in the Categorical.get_values doc string that an Index object could be return and not a numpy array. The simplest thing is to just introduce a line like this before get_indexer_dict

keys = [np.array(key) for key in keys]

A pull request for this will be created imminently.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-21-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.14.3
scipy: 1.2.1
pyarrow: 0.13.0
xarray: None
IPython: 7.4.0
sphinx: 2.0.0
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.3.1
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None

@topper-123
Copy link
Contributor

Yeah, that's a bug. See you've already made a PR, thanks for that.

@mroeschke
Copy link
Member

Looks like this is fixed on master. Could use a test

In [122]: df = pd.DataFrame({
     ...:     'a': pd.Series(list('abc')),
     ...:     'b': pd.Series(pd.to_datetime(['2018-01-01', '2018-02-01', '2018-03-01']), dtype='category'),
     ...:     'c': pd.Categorical.from_codes([-1, 0, 1], categories=[0, 1])
     ...: })
     ...:
     ...: df.groupby(['a', 'b']).indices
Out[122]:
{('a', Timestamp('2018-01-01 00:00:00')): array([0]),
 ('b', Timestamp('2018-02-01 00:00:00')): array([1]),
 ('c', Timestamp('2018-03-01 00:00:00')): array([2])}

In [123]: pd.__version__
Out[123]: '1.1.0.dev0+1974.g0159cba6e'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Groupby labels Jun 28, 2020
@mahaoyu
Copy link

mahaoyu commented Jul 12, 2020

take

@zeina99
Copy link

zeina99 commented Aug 8, 2020

@mahaoyu are you still working on it?

@FivelMttz
Copy link

Is anybody still working on it?

@alexifm
Copy link
Author

alexifm commented Dec 3, 2020

@FivelMttz updated my old tests and got the PR going.

@jreback jreback added this to the 1.2 milestone Dec 4, 2020
@jreback jreback added the Groupby label Dec 4, 2020
@jreback jreback modified the milestones: 1.2, Contributions Welcome Dec 4, 2020
@mroeschke mroeschke mentioned this issue May 29, 2021
15 tasks
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.3 May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment