Series.value_counts: Preserve original ordering #24302

tomspur · 2018-12-16T03:26:07Z

Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

closes ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
About 10 tests of the testsuite are currently failing and I hope to find a fix for them soon as well. Wanted to put up this PR now already to get feedback on the current implementation as well.

pep8speaks · 2018-12-16T03:26:14Z

Hello @tomspur! Thanks for updating the PR.

In the file pandas/tests/test_algos.py, following are the PEP8 issues :

Line 966:80: E501 line too long (82 > 79 characters)

Comment last updated on December 23, 2018 at 20:34 Hours UTC

jreback · 2018-12-16T03:59:43Z

pandas/tests/series/test_value_counts.py

+    tm.assert_series_equal(Series(vc.index), s)
+
+
+def test_original_ordering_value_counts2():


we already have tests for value_counts, don’t add a new file just change the tests

Thanks! I'll move them to that location as well.

jreback · 2018-12-16T04:00:24Z

pandas/core/series.py

@@ -2660,16 +2661,6 @@ def sort_values(self, axis=0, ascending=True, inplace=False,
            raise ValueError("This Series is a view of some other array, to "
                             "sort in-place you must create a copy")

-        def _try_kind_sort(arr):


why you changing this?

I couldn't get it working to separate the sorting and ascending vs non-ascending ordering further below in the code and changed it to pass it on to nargsort that does both at once

jreback · 2018-12-16T04:00:56Z

pandas/core/algorithms.py

+            # Use same index as the original values
+            if result.index.isna().sum() > 0:
+                fill_value = result[result.index.isna()].values[0]
+                result = result.reindex(unique(values), fill_value=fill_value)


huh? this is way beyond scope

The problem is that _value_counts_arraylike replaces all generic nan values (e.g. None) with np.NaN and reindexing has a mismatch between the original None and the new np.NaN in the index. The fill_value replaces the new missing value with the original value again.

Alternatively, it is also possible to use a generic values = values.fillna(np.NaN) to replace the Nones with np.NaNs if there are nans in the index. I'll push a follow up commit about this in a bit. Would that be better than the above?

codecov · 2018-12-16T23:40:40Z

Codecov Report

Merging #24302 into master will decrease coverage by 49.27%.
The diff coverage is 18.18%.

@@             Coverage Diff             @@
##           master   #24302       +/-   ##
===========================================
- Coverage   92.28%      43%   -49.28%     
===========================================
  Files         162      162               
  Lines       51827    51829        +2     
===========================================
- Hits        47827    22289    -25538     
- Misses       4000    29540    +25540

Flag	Coverage Δ
#multiple	`?`
#single	`43% <18.18%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`49.84% <0%> (-45.27%)`	⬇️
pandas/core/series.py	`49.32% <100%> (-44.38%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-98.65%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
... and 122 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df3b045...f562ded. Read the comment docs.

codecov · 2018-12-16T23:40:41Z

Codecov Report

Merging #24302 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24302      +/-   ##
==========================================
+ Coverage    92.3%    92.3%   +<.01%     
==========================================
  Files         163      163              
  Lines       51947    51953       +6     
==========================================
+ Hits        47949    47955       +6     
  Misses       3998     3998

Flag	Coverage Δ
#multiple	`90.71% <100%> (ø)`	⬆️
#single	`42.98% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`95.15% <100%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cd077a...355f872. Read the comment docs.

TomAugspurger · 2018-12-17T18:04:11Z

pandas/core/algorithms.py

@@ -706,6 +706,19 @@ def value_counts(values, sort=True, ascending=False, normalize=False,
                keys = Index(keys)
            result = Series(counts, index=keys, name=name)

+            # Use same index as the original values
+            if result.index.isna().sum() > 0:


I don't understand this section. Are there cases where values.fillna(np.nan) does anything?

What basically happens a few lines further is a reindex of a value_counts:

vals = [1, None, 2, 2] vals.value_counts(dropna=False).reindex(pd.Series(pd.unique([1, None, 2, 2])))

The value_counts transforms the None into NaN in the index, so that the count will be NaN as well (and this value count is removed from the returning reindexed Series).

The fillna is used to transform the None from the reindex as well, so that this is executed instead, so that the resulting Series still contains the value count:

pd.Series(vals).value_counts(dropna=False).reindex(pd.Series(pd.unique(vals)).fillna(np.NaN)

I hope this makes sense, because I am quite confused as well from times. If you have another solution to fix this edge case, please let me know...

@tomspur I am not sure what you are doing here at all. see the OP for the solution, which is a simple re-index. this PR is way out of scope.

@jreback
The re-index seems to work for sort=False and non-categorical indices and I could add that in this PR.

I wanted to further fix this for sort=True, when several values have the same count, e.g. this from the tests that were added in this PR:

In [2]: Series(list('bacaef')) In [3]: s.value_counts(sort=True) Out[3]: a 2 c 1 e 1 b 1 f 1 dtype: int64

And I wanted to have them in the order abcef as well in that case.

I'll have a look then at sort=False for now as you suggested and leave sort=True for another PR

jreback

see comments

doc/source/whatsnew/v0.24.0.rst

pandas/tests/test_algos.py

Ensure that value_counts returns the same ordering of the indices than the input object when sorting the values no matter if it is ascending or descending. This fixes pandas-dev#12679.

jreback · 2018-12-23T15:54:55Z

doc/source/whatsnew/v0.24.0.rst

@@ -1634,6 +1634,7 @@ Other

 - Bug where C variables were declared with external linkage causing import errors if certain other C libraries were imported before Pandas. (:issue:`24113`)
 - Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`)
+- :meth:`Series.value_counts` returns the counts in the same ordering as the original series when using ``sort=False`` (:issue:`12679`)


move to api breaking changes

Done and pushed

jreback · 2018-12-23T15:55:33Z

pandas/core/algorithms.py

@@ -708,6 +708,10 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

    if sort:
        result = result.sort_values(ascending=ascending)
+    elif bins is None:
+        uniq = unique(values)
+        if not isinstance(result.index, CategoricalIndex):


why is this check needed? (or maybe it just needs to be for an ordered categorical)

The issue was the test TestCategoricalSeriesAnalytics.test_value_counts:

cats = Categorical(list('abcccb'), categories=list('cabd')) s = Series(cats, name='xxx') res = s.value_counts(sort=False)

which returns 0 for the dcategory as well, which is not in the unique(values). Is there another possibility to get access to that initial categorical from the index?

pandas/core/algorithms.py

jreback · 2018-12-23T15:56:19Z

pandas/tests/test_algos.py

@@ -962,6 +962,52 @@ def test_value_counts_uint64(self):
        if not compat.is_platform_32bit():
            tm.assert_series_equal(result, expected)

+    def test_value_counts_nonsorted_single_occurance(self):


paramterize on sort

Done and pushed

jreback · 2018-12-23T15:56:33Z

pandas/tests/test_algos.py

+        vc = s.value_counts(sort=False, ascending=True)
+        tm.assert_series_equal(Series(vc.index), s)
+
+    @pytest.mark.xfail(reason="sort=True does not guarantee the same order")


why is this xfail?

I could not get it working for the sort=True case and left the tests only for possible future fixing... Would you prefer deleting them and adding them when sort=True works as well?

jreback · 2018-12-23T15:57:03Z

pandas/tests/test_algos.py

+        vc = s.value_counts(sort=True, ascending=True)
+        tm.assert_series_equal(Series(vc.index), s)
+
+    def test_value_counts_nonsorted_double_occurance(self):


parametrize these.

The double_occurance tests have different expected results and I wouldn't parametrize it due to that. Or would you do this as well in that case?

I c, ok then, parameterize over ascending though

jreback · 2018-12-23T17:50:43Z

pandas/core/algorithms.py

@@ -708,6 +708,10 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

    if sort:
        result = result.sort_values(ascending=ascending)
+    elif bins is None:
+        uniq = unique(values)


so do
uniq = uniques(values) for the non-EA case (above)
and do uniq = Series(values)._values.unique() for the EA case. though this means computing it twice. maybe have to work on that.

I added it above, although it is not computed twice now. Did you mean it like this?

jreback · 2018-12-23T19:09:51Z

pandas/core/algorithms.py


            if not isinstance(keys, Index):
                keys = Index(keys)
            result = Series(counts, index=keys, name=name)

    if sort:
        result = result.sort_values(ascending=ascending)
+    elif bins is None:
+        if not isinstance(result.index, ABCCategoricalIndex):


this should not be necessary

It unfortunately is for cases, where the categories contain more elements than the values: #24302 (comment)

it is not necessary otherwise the impl is incorrect

categoricals uniq already handles this

It doesn't seem so as unique does not contain d in this example:

In [31]: s = pd.Series(pd.Categorical(list('baabc'), categories=list('abcd'))) In [32]: s.value_counts() Out[32]: b 2 a 2 c 1 d 0 dtype: int64 In [33]: pd.unique(s) Out[33]: [b, a, c] Categories (3, object): [b, a, c]

https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L2267 also mentions that unused categories are not returned.

It would be consistent to not return it in the value_counts in the above example as well, but would change the current behaviour... What do you think?

pandas/tests/test_algos.py

jreback · 2018-12-23T19:12:47Z

pandas/tests/test_algos.py

+        vc = s.value_counts(sort=True, ascending=True)
+        tm.assert_series_equal(Series(vc.index), s)
+
+    def test_value_counts_nonsorted_double_occurance(self):


I c, ok then, parameterize over ascending though

pandas/tests/test_algos.py

pandas/core/algorithms.py

WillAyd

Can you merge master?

WillAyd · 2019-02-27T23:29:23Z

doc/source/whatsnew/v0.24.0.rst

@@ -385,6 +385,7 @@ Backwards incompatible API changes
 - :func:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`)
 - The column order of the resultant :class:`DataFrame` from :meth:`MultiIndex.to_frame` is now guaranteed to match the :attr:`MultiIndex.names` order. (:issue:`22420`)
 - :func:`pd.offsets.generate_range` argument ``time_rule`` has been removed; use ``offset`` instead (:issue:`24157`)
+- :meth:`Series.value_counts` returns the counts in the same ordering as the original series when using ``sort=False`` (:issue:`12679`)


Move to 0.25

WillAyd · 2019-03-19T03:43:11Z

Closing as stale. Ping if you'd like to continue

jreback requested changes Dec 16, 2018

View reviewed changes

TomAugspurger reviewed Dec 17, 2018

View reviewed changes

jreback requested changes Dec 18, 2018

View reviewed changes

tomspur force-pushed the series_ordering branch 3 times, most recently from a119a45 to 0dc7a21 Compare December 21, 2018 12:20

gfyoung added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Dec 22, 2018

gfyoung reviewed Dec 22, 2018

View reviewed changes

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved

gfyoung reviewed Dec 22, 2018

View reviewed changes

pandas/tests/test_algos.py Outdated Show resolved Hide resolved

tomspur added 3 commits December 23, 2018 14:35

Series.value_counts: Preserve original ordering when using sort=False

0270839

Ensure that value_counts returns the same ordering of the indices than the input object when sorting the values no matter if it is ascending or descending. This fixes pandas-dev#12679.

Only reindex value_counts if bins is None

c857119

Modularize value_counts tests and xfail sort=True ones

e966aa7

tomspur force-pushed the series_ordering branch from 0dc7a21 to e966aa7 Compare December 23, 2018 13:35

jreback requested changes Dec 23, 2018

View reviewed changes

tomspur added 2 commits December 23, 2018 19:15

Move whatsnew entry to correct place

b827df6

Use ABCCategoricalIndex instead of CategoricalIndex

453d3df

jreback requested changes Dec 23, 2018

View reviewed changes

tomspur added 2 commits December 23, 2018 20:23

Calculate unique for the EA vs non-EA case

405ac0e

Parametrize value_counts test

355f872

tomspur force-pushed the series_ordering branch from 1dded41 to 355f872 Compare December 23, 2018 20:34

WillAyd requested changes Feb 27, 2019

View reviewed changes

WillAyd closed this Mar 19, 2019

has2k1 mentioned this pull request Oct 1, 2019

BUG: Fix Series.sort_values descending & mergesort #28698

Closed

5 tasks

realead mentioned this pull request Jan 1, 2021

ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679

Closed

		tm.assert_series_equal(Series(vc.index), s)


		def test_original_ordering_value_counts2():

Series.value_counts: Preserve original ordering #24302

Series.value_counts: Preserve original ordering #24302

Conversation

tomspur commented Dec 16, 2018 • edited Loading

pep8speaks commented Dec 16, 2018 • edited Loading

Comment last updated on December 23, 2018 at 20:34 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 16, 2018

Codecov Report

codecov bot commented Dec 16, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Mar 19, 2019

tomspur commented Dec 16, 2018 •

edited

Loading

pep8speaks commented Dec 16, 2018 •

edited

Loading

codecov bot commented Dec 16, 2018 •

edited

Loading