DOC: get_dummies behaviour with dummy_na = True is counter-intuitive / incorrect when no NaN values present

### Pandas version checks

- [X] I have checked that the issue still exists on the latest versions of the docs on `main` [here](https://pandas.pydata.org/docs/dev/)


### Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

### Documentation problem

The [docs](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) currently read for this function:

> Add a column to indicate NaNs, if False NaNs are ignored.

However, when no NaN values are present, a useless constant NaN indicator column is still added:

```python
>>> df = pd.DataFrame(pd.Series([0, 1], dtype="int64"))
>>> df
   0
0  0
1  1
>>> pd.get_dummies(df, columns=[0], dummy_na=True, drop_first=True)
   0_1.0  0_nan
0  False  False
1   True  False
>>> pd.get_dummies(df, columns=[0], dummy_na=True)
   0_0.0  0_1.0  0_nan
0   True  False  False
1  False   True  False
```

This is arguably quite unexpected behaviour, as constant columns do not contain information except in some very rare cases and for specific custom models. 

I.e. for almost any kind of model this column will be ignored (but then annoyingly clutter e.g. `.feature_importances` variables and also perhaps needlessly increase compute times for algorithms that scale significantly with the number of features, but which may not have methods for ignoring constant input features). For data with a lot of binary features, and pipelines or models which also might do e.g. conversion to floating-point dtypes, all these useless extra constant features could also significantly increase memory requirements (especially in multiprocessing contexts). 

I imagine the intended design decision is that if you do e.g. `df1, df2 = train_test_split(df)`, and it ends up such that `df1` doesn't have any NaN values for some feature `"f"`, but `df2` does, then at least with the current implementation, the user can be ensured the following does not raise an AssertionError:

```python
dummies1 = pd.get_dummies(df1, ["f"], dummy_na=True)
dummies2 = pd.get_dummies(df2, ["f"], dummy_na=True)
assert dummies1.columns.tolist() == dummies2.columns.tolist()
```

But still, that is a pretty strange use-case, as dummification should generally happen right at the beginning on the full data. 

In my opinion the default behaviour should to _**only** add a NaN indicator column if NaN values are actually present..._ . I actually would consider this to be an implementation or design bug, for like 99% of use-cases. But at bare minimum this undesirable and unexpected behaviour should be documented with some reasoning. 

### Suggested fix for documentation

Just change the docs to:

> Add a column to indicate NaNs. If True, A NaN indicator column will be added even if no NaN values are present [insert reasoning here]. If False, NaNs are ignored [actually "ignored" is extremely unclear too, as per [this issue](https://github.com/pandas-dev/pandas/issues/15923)].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DOC: get_dummies behaviour with dummy_na = True is counter-intuitive / incorrect when no NaN values present #59968

Pandas version checks

Location of the documentation

Documentation problem

Suggested fix for documentation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DOC: get_dummies behaviour with dummy_na = True is counter-intuitive / incorrect when no NaN values present #59968

Description

Pandas version checks

Location of the documentation

Documentation problem

Suggested fix for documentation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions