Skip to content

ENH: Add dropna argument to DataFrame.value_counts() #41334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@ Other enhancements
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` return a ``BooleanDtype`` for columns with nullable data types (:issue:`33449`)
- Constructing a :class:`DataFrame` or :class:`Series` with the ``data`` argument being a Python iterable that is *not* a NumPy ``ndarray`` consisting of NumPy scalars will now result in a dtype with a precision the maximum of the NumPy scalars; this was already the case when ``data`` is a NumPy ``ndarray`` (:issue:`40908`)
- Add keyword ``sort`` to :func:`pivot_table` to allow non-sorting of the result (:issue:`39143`)
- Add keyword ``dropna`` to :meth:`DataFrame.value_counts` to allow counting rows that include ``NA`` values (:issue:`41325`)
-

.. ---------------------------------------------------------------------------
Expand Down
32 changes: 31 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6380,6 +6380,7 @@ def value_counts(
normalize: bool = False,
sort: bool = True,
ascending: bool = False,
dropna: bool = True,
):
"""
Return a Series containing counts of unique rows in the DataFrame.
Expand All @@ -6396,6 +6397,10 @@ def value_counts(
Sort by frequencies.
ascending : bool, default False
Sort in ascending order.
dropna : bool, default True
Don’t include counts of rows that contain NA values.

.. versionadded:: 1.3.0

Returns
-------
Expand Down Expand Up @@ -6451,11 +6456,36 @@ def value_counts(
2 2 0.25
6 0 0.25
dtype: float64

With `dropna` set to `False` we can also count rows with NA values.

>>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
... 'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
>>> df
first_name middle_name
0 John Smith
1 Anne <NA>
2 John <NA>
3 Beth Louise

>>> df.value_counts()
first_name middle_name
Beth Louise 1
John Smith 1
dtype: int64

>>> df.value_counts(dropna=False)
first_name middle_name
Anne NaN 1
Beth Louise 1
John Smith 1
NaN 1
dtype: int64
"""
if subset is None:
subset = self.columns.tolist()

counts = self.groupby(subset).grouper.size()
counts = self.groupby(subset, dropna=dropna).grouper.size()

if sort:
counts = counts.sort_values(ascending=ascending)
Expand Down
44 changes: 44 additions & 0 deletions pandas/tests/frame/methods/test_value_counts.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,47 @@ def test_data_frame_value_counts_empty_normalize():
expected = pd.Series([], dtype=np.float64)

tm.assert_series_equal(result, expected)


def test_data_frame_value_counts_dropna_true(nulls_fixture):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to parameterize this test with dropna instead of 2 separate tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to do that while using the nulls_fixture. Any suggestion is welcome!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply add another parametrization:

@pytest.mark.parametrize("x", [1, 2])
def test_data_frame_value_counts_dropna_true(nulls_fixture, x):
    pass

But I don't know if this wouldn't make it more complicated here.

Copy link
Contributor Author

@connesy connesy May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the input and expected dataframes are quite different in the two tests, so they would need to be defined as part of the parametrization. But both are also defined using the values from the nulls_fixture, so I would somehow need to use the fixture values in the parametrization, and I'm not sure how I can do both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah now I get it. if else is the only way here, but I think this would make it more complicated than is worth

# GH 41334
df = pd.DataFrame(
{
"first_name": ["John", "Anne", "John", "Beth"],
"middle_name": ["Smith", nulls_fixture, nulls_fixture, "Louise"],
},
)
result = df.value_counts()
expected = pd.Series(
data=[1, 1],
index=pd.MultiIndex.from_arrays(
[("Beth", "John"), ("Louise", "Smith")], names=["first_name", "middle_name"]
),
)

tm.assert_series_equal(result, expected)


def test_data_frame_value_counts_dropna_false(nulls_fixture):
# GH 41334
df = pd.DataFrame(
{
"first_name": ["John", "Anne", "John", "Beth"],
"middle_name": ["Smith", nulls_fixture, nulls_fixture, "Louise"],
},
)

result = df.value_counts(dropna=False)
expected = pd.Series(
data=[1, 1, 1, 1],
index=pd.MultiIndex(
levels=[
pd.Index(["Anne", "Beth", "John"]),
pd.Index(["Louise", "Smith", nulls_fixture]),
],
codes=[[0, 1, 2, 2], [2, 0, 1, 2]],
names=["first_name", "middle_name"],
),
)

tm.assert_series_equal(result, expected)