-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Series.value_counts: Preserve original ordering #24302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @tomspur! Thanks for updating the PR.
Comment last updated on December 23, 2018 at 20:34 Hours UTC |
tm.assert_series_equal(Series(vc.index), s) | ||
|
||
|
||
def test_original_ordering_value_counts2(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have tests for value_counts, don’t add a new file just change the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll move them to that location as well.
pandas/core/series.py
Outdated
@@ -2660,16 +2661,6 @@ def sort_values(self, axis=0, ascending=True, inplace=False, | |||
raise ValueError("This Series is a view of some other array, to " | |||
"sort in-place you must create a copy") | |||
|
|||
def _try_kind_sort(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't get it working to separate the sorting and ascending vs non-ascending ordering further below in the code and changed it to pass it on to nargsort
that does both at once
pandas/core/algorithms.py
Outdated
# Use same index as the original values | ||
if result.index.isna().sum() > 0: | ||
fill_value = result[result.index.isna()].values[0] | ||
result = result.reindex(unique(values), fill_value=fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? this is way beyond scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that _value_counts_arraylike
replaces all generic nan values (e.g. None
) with np.NaN
and reindexing has a mismatch between the original None
and the new np.NaN
in the index. The fill_value replaces the new missing value with the original value again.
Alternatively, it is also possible to use a generic values = values.fillna(np.NaN)
to replace the None
s with np.NaN
s if there are nans in the index. I'll push a follow up commit about this in a bit. Would that be better than the above?
Codecov Report
@@ Coverage Diff @@
## master #24302 +/- ##
===========================================
- Coverage 92.28% 43% -49.28%
===========================================
Files 162 162
Lines 51827 51829 +2
===========================================
- Hits 47827 22289 -25538
- Misses 4000 29540 +25540
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24302 +/- ##
==========================================
+ Coverage 92.3% 92.3% +<.01%
==========================================
Files 163 163
Lines 51947 51953 +6
==========================================
+ Hits 47949 47955 +6
Misses 3998 3998
Continue to review full report at Codecov.
|
pandas/core/algorithms.py
Outdated
@@ -706,6 +706,19 @@ def value_counts(values, sort=True, ascending=False, normalize=False, | |||
keys = Index(keys) | |||
result = Series(counts, index=keys, name=name) | |||
|
|||
# Use same index as the original values | |||
if result.index.isna().sum() > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this section. Are there cases where values.fillna(np.nan)
does anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What basically happens a few lines further is a reindex of a value_counts:
vals = [1, None, 2, 2]
vals.value_counts(dropna=False).reindex(pd.Series(pd.unique([1, None, 2, 2])))
The value_counts transforms the None into NaN in the index, so that the count will be NaN as well (and this value count is removed from the returning reindexed Series).
The fillna is used to transform the None from the reindex as well, so that this is executed instead, so that the resulting Series still contains the value count:
pd.Series(vals).value_counts(dropna=False).reindex(pd.Series(pd.unique(vals)).fillna(np.NaN)
I hope this makes sense, because I am quite confused as well from times. If you have another solution to fix this edge case, please let me know...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomspur I am not sure what you are doing here at all. see the OP for the solution, which is a simple re-index. this PR is way out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
The re-index seems to work for sort=False
and non-categorical indices and I could add that in this PR.
I wanted to further fix this for sort=True
, when several values have the same count, e.g. this from the tests that were added in this PR:
In [2]: Series(list('bacaef'))
In [3]: s.value_counts(sort=True)
Out[3]:
a 2
c 1
e 1
b 1
f 1
dtype: int64
And I wanted to have them in the order abcef
as well in that case.
I'll have a look then at sort=False
for now as you suggested and leave sort=True
for another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
a119a45
to
0dc7a21
Compare
Ensure that value_counts returns the same ordering of the indices than the input object when sorting the values no matter if it is ascending or descending. This fixes pandas-dev#12679.
0dc7a21
to
e966aa7
Compare
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -1634,6 +1634,7 @@ Other | |||
|
|||
- Bug where C variables were declared with external linkage causing import errors if certain other C libraries were imported before Pandas. (:issue:`24113`) | |||
- Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`) | |||
- :meth:`Series.value_counts` returns the counts in the same ordering as the original series when using ``sort=False`` (:issue:`12679`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to api breaking changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and pushed
pandas/core/algorithms.py
Outdated
@@ -708,6 +708,10 @@ def value_counts(values, sort=True, ascending=False, normalize=False, | |||
|
|||
if sort: | |||
result = result.sort_values(ascending=ascending) | |||
elif bins is None: | |||
uniq = unique(values) | |||
if not isinstance(result.index, CategoricalIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check needed? (or maybe it just needs to be for an ordered categorical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue was the test TestCategoricalSeriesAnalytics.test_value_counts
:
cats = Categorical(list('abcccb'), categories=list('cabd'))
s = Series(cats, name='xxx')
res = s.value_counts(sort=False)
which returns 0
for the d
category as well, which is not in the unique(values)
. Is there another possibility to get access to that initial categorical from the index?
pandas/tests/test_algos.py
Outdated
@@ -962,6 +962,52 @@ def test_value_counts_uint64(self): | |||
if not compat.is_platform_32bit(): | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_value_counts_nonsorted_single_occurance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paramterize on sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and pushed
pandas/tests/test_algos.py
Outdated
vc = s.value_counts(sort=False, ascending=True) | ||
tm.assert_series_equal(Series(vc.index), s) | ||
|
||
@pytest.mark.xfail(reason="sort=True does not guarantee the same order") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this xfail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not get it working for the sort=True
case and left the tests only for possible future fixing... Would you prefer deleting them and adding them when sort=True
works as well?
pandas/tests/test_algos.py
Outdated
vc = s.value_counts(sort=True, ascending=True) | ||
tm.assert_series_equal(Series(vc.index), s) | ||
|
||
def test_value_counts_nonsorted_double_occurance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parametrize these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The double_occurance
tests have different expected results and I wouldn't parametrize it due to that. Or would you do this as well in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I c, ok then, parameterize over ascending though
pandas/core/algorithms.py
Outdated
@@ -708,6 +708,10 @@ def value_counts(values, sort=True, ascending=False, normalize=False, | |||
|
|||
if sort: | |||
result = result.sort_values(ascending=ascending) | |||
elif bins is None: | |||
uniq = unique(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so do
uniq = uniques(values)
for the non-EA case (above)
and do uniq = Series(values)._values.unique()
for the EA case. though this means computing it twice. maybe have to work on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it above, although it is not computed twice now. Did you mean it like this?
|
||
if not isinstance(keys, Index): | ||
keys = Index(keys) | ||
result = Series(counts, index=keys, name=name) | ||
|
||
if sort: | ||
result = result.sort_values(ascending=ascending) | ||
elif bins is None: | ||
if not isinstance(result.index, ABCCategoricalIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It unfortunately is for cases, where the categories contain more elements than the values: #24302 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not necessary otherwise the impl is incorrect
categoricals uniq already handles this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem so as unique
does not contain d
in this example:
In [31]: s = pd.Series(pd.Categorical(list('baabc'), categories=list('abcd')))
In [32]: s.value_counts()
Out[32]:
b 2
a 2
c 1
d 0
dtype: int64
In [33]: pd.unique(s)
Out[33]:
[b, a, c]
Categories (3, object): [b, a, c]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L2267 also mentions that unused categories are not returned.
It would be consistent to not return it in the value_counts
in the above example as well, but would change the current behaviour... What do you think?
pandas/tests/test_algos.py
Outdated
vc = s.value_counts(sort=True, ascending=True) | ||
tm.assert_series_equal(Series(vc.index), s) | ||
|
||
def test_value_counts_nonsorted_double_occurance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I c, ok then, parameterize over ascending though
1dded41
to
355f872
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge master?
@@ -385,6 +385,7 @@ Backwards incompatible API changes | |||
- :func:`read_csv` will now raise a ``ValueError`` if a column with missing values is declared as having dtype ``bool`` (:issue:`20591`) | |||
- The column order of the resultant :class:`DataFrame` from :meth:`MultiIndex.to_frame` is now guaranteed to match the :attr:`MultiIndex.names` order. (:issue:`22420`) | |||
- :func:`pd.offsets.generate_range` argument ``time_rule`` has been removed; use ``offset`` instead (:issue:`24157`) | |||
- :meth:`Series.value_counts` returns the counts in the same ordering as the original series when using ``sort=False`` (:issue:`12679`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to 0.25
Closing as stale. Ping if you'd like to continue |
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.
git diff upstream/master -u -- "*.py" | flake8 --diff