DEPR: Some dropna behaviors in DataFrame.pivot_table #53521

rhshadrach · 2023-06-04T16:14:40Z

Currently dropna is used in four places within DataFrame.pivot_table:

It takes the cartesian product of all index/column levels when there are multiple levels; this was the original use
It is passed through to groupby
After the groupby aggregation, any rows that are all null are dropped
When computing the margins, rows in the original data where the keys and values are all null are dropped

1, 2, and 4 were all implemented for crosstab, which is essentially a call to pivot_table.

The API docs for crosstab document the dropna argument as:

Do not include columns whose entries are all NaN.

The only other documentation in the API and User Guide mentions using dropna=False to include rows/columns for categorical data with missing categorical values.

I think this is too much for a single Boolean argument to handle. I propose the following:

a. Add cartesian_product=[True|False] to pivot_table and crosstab
b. Add observed=[True|False] to crosstab for use with categoricals
c. Deprecate behavior (1) (with dropna), (3), and (4) above. The user may do each of these by dropping null values from the input data if they so desire.

We can implement (c) without affecting the behavior of crosstab by changing the data there to be a mixture of null/non-null values depending on the input and using the aggregation count instead of len.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-09-24T12:33:40Z

Instead of adding cartesian_product, we could make observed=False take the cartesian product of both the index and the column, regardless of whether they are categorical. In my opinion, this would only be okay to do if #55261 gets implemented.

Doing this would mean more than just passing observed=False through to groupby - that would only handle the index. We would still need to take the cartesian product of the columns in the case they are a MultiIndex as is done with dropna=False today.

If we're going this route, then I think we should also adhere to groupby semantics with unobserved groupings for various ops. For example:

df = pd.DataFrame(
    {
        'idx1': 1,
        'idx2': [2, 3],
        'col1': 4,
        'col2': [5, 6],
        'val1': [7, 8],
        'val2': [9, 10],
    }
)
df.pivot_table(index=['idx1', 'idx2'], columns=['col1', 'col2'], values=['val1', 'val2'], aggfunc='sum', dropna=False)

currently results in NaN values in various locations; instead with observed=False it should be 0 because the specified aggfunc is sum.

rhshadrach · 2025-03-22T11:48:08Z

Looking at this again - I think we should also deprecate (2). This can be done by the user as it's just a matter of dropping NA values from the input data.

tehunter · 2025-04-23T11:50:59Z

+1

As a user, I've learned I have to avoid .pivot_table, especially with margins. I've had issues in the past where the margins didn't aggregate up properly so there was disagreement with the "base" cells. I never had time to thoroughly debug and find where the issue was coming from, but after reading this, I suspect it might be related to the treatment of NAs.

rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jun 4, 2023

rhshadrach mentioned this issue Jun 6, 2023

PDEP-11: Change default of dropna to False #53094

Open

rhshadrach mentioned this issue Sep 24, 2023

ENH: Respect observed=False in groupby for non-categoricals #55261

Open

rhshadrach mentioned this issue Jan 13, 2024

BUG: pivot_table downcasting dtypes even if not necessary #47971

Closed

3 tasks

rhshadrach mentioned this issue Mar 5, 2024

DataFrame.unstack() drops columns if dataframe is empty #21255

Open

rhshadrach mentioned this issue Jan 25, 2025

BUG: pd.crosstab(dropna=True, values=) drops rows and columns where any aggregation result is null. #60767

Open

3 tasks

snitish mentioned this issue Mar 13, 2025

BUG: pivot_table drops rows and columns despite values being non-NaN #61113

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: Some dropna behaviors in DataFrame.pivot_table #53521

DEPR: Some dropna behaviors in DataFrame.pivot_table #53521

rhshadrach commented Jun 4, 2023

rhshadrach commented Sep 24, 2023 •

edited

Loading

rhshadrach commented Mar 22, 2025

tehunter commented Apr 23, 2025

DEPR: Some dropna behaviors in DataFrame.pivot_table #53521

DEPR: Some dropna behaviors in DataFrame.pivot_table #53521

Comments

rhshadrach commented Jun 4, 2023

rhshadrach commented Sep 24, 2023 • edited Loading

rhshadrach commented Mar 22, 2025

tehunter commented Apr 23, 2025

rhshadrach commented Sep 24, 2023 •

edited

Loading