Skip to content

groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe #21390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rogeriomgatto opened this issue Jun 8, 2018 · 6 comments · Fixed by #21657
Labels
Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@rogeriomgatto
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'label1': list('abcbabcba'),
    'label2': list('xyxyxyxyx'),
    'minute': list(pd.date_range('2018-06-01 00', freq='1T', periods=3)) * 3,
    'n1': np.arange(9, dtype='float'),
    'n2': np.arange(9, dtype='float') ** 2
})

# this is correct
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

# convert to categoricals
df['label1'] = df['label1'].astype('category')
df['label2'] = df['label2'].astype('category')
df['minute'] = df['minute'].astype('category')

# this is wrong, returns all NaNs
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

Problem description

When grouping by [str, datetime] columns, results are as expected:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                             n1    n2
label1 minute                        
a      2018-06-01 00:00:00  0.0   0.0
       2018-06-01 00:01:00  4.0  16.0
       2018-06-01 00:02:00  8.0  64.0
b      2018-06-01 00:00:00  3.0   9.0
       2018-06-01 00:01:00  4.0  25.0
       2018-06-01 00:02:00  5.0  25.0
c      2018-06-01 00:00:00  6.0  36.0
       2018-06-01 00:02:00  2.0   4.0

After converting label1, label2, and minute to categoricals, that same groupby returns all NaNs:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                            n1  n2
label1 minute                     
a      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
b      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
c      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN

I only got this bug when grouping on 2 categoricals with one of them being datetime based (order is irrelevant). Grouping by ['label1', 'label2'] and 'minute' by itself works as expected.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Groupby Categorical Categorical Data Type labels Jun 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 8, 2018

Looks very similar to #21334

cc @jreback

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jun 8, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.2 milestone Jun 8, 2018
@jreback jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018
@jorisvandenbossche
Copy link
Member

This seems to boil down to a problem with reindexing with such a categorical index:

idx = pd.MultiIndex.from_product([pd.Categorical(['a', 'b', 'c']), pd.Categorical(pd.date_range("2012-01-01", periods=3, freq='H'))])
df = pd.DataFrame({'a': range(len(idx))}, index=idx)
df2 = df.iloc[[0, 1, 2, 3, 4, 5, 6, 8]]
df2.reindex(idx)

on 0.22.0 works correctly, but on master gives:

In [23]: df2.reindex(idx)
Out[23]: 
                        a
a 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN
b 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN
c 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN

@jorisvandenbossche
Copy link
Member

cc @toobaz this seems to be related to the new MultiIndexUIntEngine

Using the above example (idx is a MultiIndex):

In [6]: idx._engine.get_indexer(idx)
Out[6]: array([-1, -1, -1, -1, -1, -1, -1, -1, -1])

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 27, 2018

Sorry Pietro, probably a bit prematurely pointed to that :-), as in the end it is code in the MultiIndexUIntEngine that surfaces another bug. Iterating a MultiIndex (tolist) with categorical datetime is broken (but was already broken in 0.22.0, just now surfaces through the use in the multiindex engine):

In [21]: list(idx)
Out[21]: 
[('a', 1325376000000000000),
 ('a', 1325379600000000000),
 ('a', 1325383200000000000),
 ('b', 1325376000000000000),
 ('b', 1325379600000000000),
 ('b', 1325383200000000000),
 ('c', 1325376000000000000),
 ('c', 1325379600000000000),
 ('c', 1325383200000000000)]

In [22]: list(idx.get_level_values(1))
Out[22]: 
[Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00'),
 Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00'),
 Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00')]

@toobaz
Copy link
Member

toobaz commented Jun 27, 2018

Sorry Pietro, probably a bit prematurely pointed to that :-)

Good :-) In general, it is unlikely that bugs in the MI engine code are dtype-specific, as it entirely delegates actual lookup to single levels, and only looks for integers (codes).

@jorisvandenbossche
Copy link
Member

PR: #21657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants