Skip to content

DOC: get_dummies behaviour with dummy_na = True is counter-intuitive / incorrect when no NaN values present #59968

Closed
@DM-Berger

Description

@DM-Berger

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

Documentation problem

The docs currently read for this function:

Add a column to indicate NaNs, if False NaNs are ignored.

However, when no NaN values are present, a useless constant NaN indicator column is still added:

>>> df = pd.DataFrame(pd.Series([0, 1], dtype="int64"))
>>> df
   0
0  0
1  1
>>> pd.get_dummies(df, columns=[0], dummy_na=True, drop_first=True)
   0_1.0  0_nan
0  False  False
1   True  False
>>> pd.get_dummies(df, columns=[0], dummy_na=True)
   0_0.0  0_1.0  0_nan
0   True  False  False
1  False   True  False

This is arguably quite unexpected behaviour, as constant columns do not contain information except in some very rare cases and for specific custom models.

I.e. for almost any kind of model this column will be ignored (but then annoyingly clutter e.g. .feature_importances variables and also perhaps needlessly increase compute times for algorithms that scale significantly with the number of features, but which may not have methods for ignoring constant input features). For data with a lot of binary features, and pipelines or models which also might do e.g. conversion to floating-point dtypes, all these useless extra constant features could also significantly increase memory requirements (especially in multiprocessing contexts).

I imagine the intended design decision is that if you do e.g. df1, df2 = train_test_split(df), and it ends up such that df1 doesn't have any NaN values for some feature "f", but df2 does, then at least with the current implementation, the user can be ensured the following does not raise an AssertionError:

dummies1 = pd.get_dummies(df1, ["f"], dummy_na=True)
dummies2 = pd.get_dummies(df2, ["f"], dummy_na=True)
assert dummies1.columns.tolist() == dummies2.columns.tolist()

But still, that is a pretty strange use-case, as dummification should generally happen right at the beginning on the full data.

In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present... . I actually would consider this to be an implementation or design bug, for like 99% of use-cases. But at bare minimum this undesirable and unexpected behaviour should be documented with some reasoning.

Suggested fix for documentation

Just change the docs to:

Add a column to indicate NaNs. If True, A NaN indicator column will be added even if no NaN values are present [insert reasoning here]. If False, NaNs are ignored [actually "ignored" is extremely unclear too, as per this issue].

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions