Skip to content

Series.value_counts: Preserve original ordering #24302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

tomspur
Copy link

@tomspur tomspur commented Dec 16, 2018

Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

@pep8speaks
Copy link

pep8speaks commented Dec 16, 2018

Hello @tomspur! Thanks for updating the PR.

Line 966:80: E501 line too long (82 > 79 characters)

Comment last updated on December 23, 2018 at 20:34 Hours UTC

tm.assert_series_equal(Series(vc.index), s)


def test_original_ordering_value_counts2():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have tests for value_counts, don’t add a new file just change the tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll move them to that location as well.

@@ -2660,16 +2661,6 @@ def sort_values(self, axis=0, ascending=True, inplace=False,
raise ValueError("This Series is a view of some other array, to "
"sort in-place you must create a copy")

def _try_kind_sort(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you changing this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get it working to separate the sorting and ascending vs non-ascending ordering further below in the code and changed it to pass it on to nargsort that does both at once

# Use same index as the original values
if result.index.isna().sum() > 0:
fill_value = result[result.index.isna()].values[0]
result = result.reindex(unique(values), fill_value=fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? this is way beyond scope

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that _value_counts_arraylike replaces all generic nan values (e.g. None) with np.NaN and reindexing has a mismatch between the original None and the new np.NaN in the index. The fill_value replaces the new missing value with the original value again.

Alternatively, it is also possible to use a generic values = values.fillna(np.NaN) to replace the Nones with np.NaNs if there are nans in the index. I'll push a follow up commit about this in a bit. Would that be better than the above?

@codecov
Copy link

codecov bot commented Dec 16, 2018

Codecov Report

Merging #24302 into master will decrease coverage by 49.27%.
The diff coverage is 18.18%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24302       +/-   ##
===========================================
- Coverage   92.28%      43%   -49.28%     
===========================================
  Files         162      162               
  Lines       51827    51829        +2     
===========================================
- Hits        47827    22289    -25538     
- Misses       4000    29540    +25540
Flag Coverage Δ
#multiple ?
#single 43% <18.18%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 49.84% <0%> (-45.27%) ⬇️
pandas/core/series.py 49.32% <100%> (-44.38%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-98.65%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 122 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df3b045...f562ded. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 16, 2018

Codecov Report

Merging #24302 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24302      +/-   ##
==========================================
+ Coverage    92.3%    92.3%   +<.01%     
==========================================
  Files         163      163              
  Lines       51947    51953       +6     
==========================================
+ Hits        47949    47955       +6     
  Misses       3998     3998
Flag Coverage Δ
#multiple 90.71% <100%> (ø) ⬆️
#single 42.98% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 95.15% <100%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cd077a...355f872. Read the comment docs.

@@ -706,6 +706,19 @@ def value_counts(values, sort=True, ascending=False, normalize=False,
keys = Index(keys)
result = Series(counts, index=keys, name=name)

# Use same index as the original values
if result.index.isna().sum() > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this section. Are there cases where values.fillna(np.nan) does anything?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What basically happens a few lines further is a reindex of a value_counts:

vals = [1, None, 2, 2]
vals.value_counts(dropna=False).reindex(pd.Series(pd.unique([1, None, 2, 2])))

The value_counts transforms the None into NaN in the index, so that the count will be NaN as well (and this value count is removed from the returning reindexed Series).

The fillna is used to transform the None from the reindex as well, so that this is executed instead, so that the resulting Series still contains the value count:

pd.Series(vals).value_counts(dropna=False).reindex(pd.Series(pd.unique(vals)).fillna(np.NaN)

I hope this makes sense, because I am quite confused as well from times. If you have another solution to fix this edge case, please let me know...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomspur I am not sure what you are doing here at all. see the OP for the solution, which is a simple re-index. this PR is way out of scope.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
The re-index seems to work for sort=False and non-categorical indices and I could add that in this PR.

I wanted to further fix this for sort=True, when several values have the same count, e.g. this from the tests that were added in this PR:

In [2]: Series(list('bacaef'))
In [3]: s.value_counts(sort=True)                                                                                                                                                                                                             
Out[3]: 
a    2
c    1
e    1
b    1
f    1
dtype: int64

And I wanted to have them in the order abcef as well in that case.

I'll have a look then at sort=False for now as you suggested and leave sort=True for another PR

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@tomspur tomspur force-pushed the series_ordering branch 3 times, most recently from a119a45 to 0dc7a21 Compare December 21, 2018 12:20
@gfyoung gfyoung added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Dec 22, 2018
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

This fixes pandas-dev#12679.
@@ -1634,6 +1634,7 @@ Other

- Bug where C variables were declared with external linkage causing import errors if certain other C libraries were imported before Pandas. (:issue:`24113`)
- Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`)
- :meth:`Series.value_counts` returns the counts in the same ordering as the original series when using ``sort=False`` (:issue:`12679`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to api breaking changes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and pushed

@@ -708,6 +708,10 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

if sort:
result = result.sort_values(ascending=ascending)
elif bins is None:
uniq = unique(values)
if not isinstance(result.index, CategoricalIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check needed? (or maybe it just needs to be for an ordered categorical)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was the test TestCategoricalSeriesAnalytics.test_value_counts:

cats = Categorical(list('abcccb'), categories=list('cabd'))
s = Series(cats, name='xxx')
res = s.value_counts(sort=False)

which returns 0 for the dcategory as well, which is not in the unique(values). Is there another possibility to get access to that initial categorical from the index?

@@ -962,6 +962,52 @@ def test_value_counts_uint64(self):
if not compat.is_platform_32bit():
tm.assert_series_equal(result, expected)

def test_value_counts_nonsorted_single_occurance(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paramterize on sort

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and pushed

vc = s.value_counts(sort=False, ascending=True)
tm.assert_series_equal(Series(vc.index), s)

@pytest.mark.xfail(reason="sort=True does not guarantee the same order")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this xfail?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not get it working for the sort=True case and left the tests only for possible future fixing... Would you prefer deleting them and adding them when sort=True works as well?

vc = s.value_counts(sort=True, ascending=True)
tm.assert_series_equal(Series(vc.index), s)

def test_value_counts_nonsorted_double_occurance(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parametrize these.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double_occurance tests have different expected results and I wouldn't parametrize it due to that. Or would you do this as well in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I c, ok then, parameterize over ascending though

@@ -708,6 +708,10 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

if sort:
result = result.sort_values(ascending=ascending)
elif bins is None:
uniq = unique(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so do
uniq = uniques(values) for the non-EA case (above)
and do uniq = Series(values)._values.unique() for the EA case. though this means computing it twice. maybe have to work on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it above, although it is not computed twice now. Did you mean it like this?


if not isinstance(keys, Index):
keys = Index(keys)
result = Series(counts, index=keys, name=name)

if sort:
result = result.sort_values(ascending=ascending)
elif bins is None:
if not isinstance(result.index, ABCCategoricalIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be necessary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It unfortunately is for cases, where the categories contain more elements than the values: #24302 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not necessary otherwise the impl is incorrect

categoricals uniq already handles this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem so as unique does not contain d in this example:


In [31]: s = pd.Series(pd.Categorical(list('baabc'), categories=list('abcd')))                                                                                                                                                                

In [32]: s.value_counts()                                                                                                                                                                                                                     
Out[32]: 
b    2
a    2
c    1
d    0
dtype: int64

In [33]: pd.unique(s)                                                                                                                                                                                                                         
Out[33]: 
[b, a, c]
Categories (3, object): [b, a, c]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L2267 also mentions that unused categories are not returned.

It would be consistent to not return it in the value_counts in the above example as well, but would change the current behaviour... What do you think?

vc = s.value_counts(sort=True, ascending=True)
tm.assert_series_equal(Series(vc.index), s)

def test_value_counts_nonsorted_double_occurance(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I c, ok then, parameterize over ascending though

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge master?

@@ -385,6 +385,7 @@ Backwards incompatible API changes
- :func:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`)
- The column order of the resultant :class:`DataFrame` from :meth:`MultiIndex.to_frame` is now guaranteed to match the :attr:`MultiIndex.names` order. (:issue:`22420`)
- :func:`pd.offsets.generate_range` argument ``time_rule`` has been removed; use ``offset`` instead (:issue:`24157`)
- :meth:`Series.value_counts` returns the counts in the same ordering as the original series when using ``sort=False`` (:issue:`12679`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to 0.25

@WillAyd
Copy link
Member

WillAyd commented Mar 19, 2019

Closing as stale. Ping if you'd like to continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering
6 participants