You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
groupby on a single categorical column with prescribed categories incorrectly returns results for all categories, even those that are not actually present in the DataFrame.
In addition, when aggregating after grouping on categoricals (groupby.sum and the likes), with both prescribed and non-prescribed categories, we get values for all possible combinations of categories, including those not present in the DataFrame.
Problem description and code samples
Case 1: group by a single column
Consider the following code where we define a DataFrame with a categorical column with prescribed categories.
When we group by the categorical label1 column and aggregate, we incorrectly get results for all prescribed categories, including those that are not present in the DataFrame (see rows below where the value of x is NaN):
In a nutshell:
groupby
on a single categorical column with prescribed categories incorrectly returns results for all categories, even those that are not actually present in the DataFrame.groupby.sum
and the likes), with both prescribed and non-prescribed categories, we get values for all possible combinations of categories, including those not present in the DataFrame.Problem description and code samples
Case 1: group by a single column
Consider the following code where we define a DataFrame with a categorical column with prescribed categories.
When we group by the categorical
label1
column and aggregate, we incorrectly get results for all prescribed categories, including those that are not present in the DataFrame (see rows below where the value of x isNaN
):The elements in excess are already present in the
groupby
object:The above doesn't happen if the categories for the
label1
column are not prescribed or if the column is not converted to a categorical at all.Case 2: group by multiple columns
Consider now the case where we have two categorical columns:
Contrary to the single column case above, we do get the correct group labels when we group by both categorical columns
label1
andlabel2
:But we still get incorrect results if we aggregate:
Note: The aggregation shows the same inccorect behaviour also when we don't prescribe the categories:
Expected Output
For Case 1:
For Case 2:
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: