groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe #21390

rogeriomgatto · 2018-06-08T17:41:08Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'label1': list('abcbabcba'),
    'label2': list('xyxyxyxyx'),
    'minute': list(pd.date_range('2018-06-01 00', freq='1T', periods=3)) * 3,
    'n1': np.arange(9, dtype='float'),
    'n2': np.arange(9, dtype='float') ** 2
})

# this is correct
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

# convert to categoricals
df['label1'] = df['label1'].astype('category')
df['label2'] = df['label2'].astype('category')
df['minute'] = df['minute'].astype('category')

# this is wrong, returns all NaNs
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

Problem description

When grouping by [str, datetime] columns, results are as expected:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                             n1    n2
label1 minute                        
a      2018-06-01 00:00:00  0.0   0.0
       2018-06-01 00:01:00  4.0  16.0
       2018-06-01 00:02:00  8.0  64.0
b      2018-06-01 00:00:00  3.0   9.0
       2018-06-01 00:01:00  4.0  25.0
       2018-06-01 00:02:00  5.0  25.0
c      2018-06-01 00:00:00  6.0  36.0
       2018-06-01 00:02:00  2.0   4.0

After converting label1, label2, and minute to categoricals, that same groupby returns all NaNs:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                            n1  n2
label1 minute                     
a      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
b      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
c      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN

I only got this bug when grouping on 2 categoricals with one of them being datetime based (order is irrelevant). Grouping by ['label1', 'label2'] and 'minute' by itself works as expected.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-06-08T17:46:35Z

Looks very similar to #21334

cc @jreback

jorisvandenbossche · 2018-06-27T13:10:06Z

This seems to boil down to a problem with reindexing with such a categorical index:

idx = pd.MultiIndex.from_product([pd.Categorical(['a', 'b', 'c']), pd.Categorical(pd.date_range("2012-01-01", periods=3, freq='H'))])
df = pd.DataFrame({'a': range(len(idx))}, index=idx)
df2 = df.iloc[[0, 1, 2, 3, 4, 5, 6, 8]]
df2.reindex(idx)

on 0.22.0 works correctly, but on master gives:

In [23]: df2.reindex(idx)
Out[23]: 
                        a
a 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN
b 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN
c 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN

jorisvandenbossche · 2018-06-27T13:45:36Z

cc @toobaz this seems to be related to the new MultiIndexUIntEngine

Using the above example (idx is a MultiIndex):

In [6]: idx._engine.get_indexer(idx)
Out[6]: array([-1, -1, -1, -1, -1, -1, -1, -1, -1])

jorisvandenbossche · 2018-06-27T13:52:54Z

Sorry Pietro, probably a bit prematurely pointed to that :-), as in the end it is code in the MultiIndexUIntEngine that surfaces another bug. Iterating a MultiIndex (tolist) with categorical datetime is broken (but was already broken in 0.22.0, just now surfaces through the use in the multiindex engine):

In [21]: list(idx)
Out[21]: 
[('a', 1325376000000000000),
 ('a', 1325379600000000000),
 ('a', 1325383200000000000),
 ('b', 1325376000000000000),
 ('b', 1325379600000000000),
 ('b', 1325383200000000000),
 ('c', 1325376000000000000),
 ('c', 1325379600000000000),
 ('c', 1325383200000000000)]

In [22]: list(idx.get_level_values(1))
Out[22]: 
[Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00'),
 Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00'),
 Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00')]

toobaz · 2018-06-27T15:18:32Z

Sorry Pietro, probably a bit prematurely pointed to that :-)

Good :-) In general, it is unlikely that bugs in the MI engine code are dtype-specific, as it entirely delegates actual lookup to single levels, and only looks for integers (codes).

jorisvandenbossche · 2018-06-27T15:55:39Z

PR: #21657

gfyoung added Groupby Categorical Categorical Data Type labels Jun 8, 2018

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jun 8, 2018

jorisvandenbossche added this to the 0.23.2 milestone Jun 8, 2018

jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018

jorisvandenbossche modified the milestones: 0.23.3, 0.23.2 Jun 27, 2018

This was referenced Jun 27, 2018

BUG: fix reindexing MultiIndex with categorical datetime-like level #21657

Merged

BUG: .values for objects containing categoricals with box-able categories #21658

Closed

jreback modified the milestones: 0.23.2, 0.23.3 Jun 28, 2018

jorisvandenbossche closed this as completed in #21657 Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe #21390

groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe #21390

rogeriomgatto commented Jun 8, 2018

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

gfyoung commented Jun 8, 2018

jorisvandenbossche commented Jun 27, 2018

jorisvandenbossche commented Jun 27, 2018

jorisvandenbossche commented Jun 27, 2018 •

edited

Loading

toobaz commented Jun 27, 2018

jorisvandenbossche commented Jun 27, 2018

groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe #21390

groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe #21390

Comments

rogeriomgatto commented Jun 8, 2018

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

gfyoung commented Jun 8, 2018

jorisvandenbossche commented Jun 27, 2018

jorisvandenbossche commented Jun 27, 2018

jorisvandenbossche commented Jun 27, 2018 • edited Loading

toobaz commented Jun 27, 2018

jorisvandenbossche commented Jun 27, 2018

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

jorisvandenbossche commented Jun 27, 2018 •

edited

Loading