Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
Documentation problem
The docs currently read for this function:
Add a column to indicate NaNs, if False NaNs are ignored.
However, when no NaN values are present, a useless constant NaN indicator column is still added:
>>> df = pd.DataFrame(pd.Series([0, 1], dtype="int64"))
>>> df
0
0 0
1 1
>>> pd.get_dummies(df, columns=[0], dummy_na=True, drop_first=True)
0_1.0 0_nan
0 False False
1 True False
>>> pd.get_dummies(df, columns=[0], dummy_na=True)
0_0.0 0_1.0 0_nan
0 True False False
1 False True False
This is arguably quite unexpected behaviour, as constant columns do not contain information except in some very rare cases and for specific custom models.
I.e. for almost any kind of model this column will be ignored (but then annoyingly clutter e.g. .feature_importances
variables and also perhaps needlessly increase compute times for algorithms that scale significantly with the number of features, but which may not have methods for ignoring constant input features). For data with a lot of binary features, and pipelines or models which also might do e.g. conversion to floating-point dtypes, all these useless extra constant features could also significantly increase memory requirements (especially in multiprocessing contexts).
I imagine the intended design decision is that if you do e.g. df1, df2 = train_test_split(df)
, and it ends up such that df1
doesn't have any NaN values for some feature "f"
, but df2
does, then at least with the current implementation, the user can be ensured the following does not raise an AssertionError:
dummies1 = pd.get_dummies(df1, ["f"], dummy_na=True)
dummies2 = pd.get_dummies(df2, ["f"], dummy_na=True)
assert dummies1.columns.tolist() == dummies2.columns.tolist()
But still, that is a pretty strange use-case, as dummification should generally happen right at the beginning on the full data.
In my opinion the default behaviour should to only add a NaN indicator column if NaN values are actually present... . I actually would consider this to be an implementation or design bug, for like 99% of use-cases. But at bare minimum this undesirable and unexpected behaviour should be documented with some reasoning.
Suggested fix for documentation
Just change the docs to:
Add a column to indicate NaNs. If True, A NaN indicator column will be added even if no NaN values are present [insert reasoning here]. If False, NaNs are ignored [actually "ignored" is extremely unclear too, as per this issue].