-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CLN: Make internal numpy sort
and argsort
use kind="stable"
#53829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sort
and argsort
use kind="stable"
sort
and argsort
use kind="stable"
sort
and argsort
use kind="stable"
Can you run the ASV benchmarks for groupby and algos to see if there are any regressions? I'm a little hesitant to blanket apply this without knowing moire about its impacts. Doubly so because the stable sort algorithm within NumPy can vary by dtype (?) |
@WillAyd I've run the benchmarks for |
Due to the aforementioned error, I cannot use Benchmarking Excluding nan and null
Only those with
|
Cool thanks for providing those. That's a decent amount of performance regression. I know this was discussed on the @pandas-dev/pandas-core call so ultimately up to the group |
A lot looks like noise, no? |
Hmm not sure. I don't know what stable sort implementations are actually being deferred to but I would think NumPy chose a quicksort as the default for CPU and memory efficiency |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
I'm no expert here, but isn't quicksort unstable? It's true that this was discussed on the call, but to be honest I hadn't appreciated the perf implications, and feel a bit uneasy about them. Can't really tell how much noise there is in the results though - @rhshadrach just for my understanding, how do you determine that? |
quicksort is not stable but mergesort is |
Wikipedia has a pretty nice summary of the different algorithms: https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_of_algorithms If you wanted to go really deep Knuth's the Art of Computer Programming Volume 3 is a great resource |
Best way is to do multiple runs to start narrowing it down. I'd consider < 5% suspect. |
@Charlie-XIAO can you merge main again; I'd like to run some more profiling |
@rhshadrach Sure, merged now. |
I ran the following two commands:
For some reason I no longer get a summary at the end of the output for the first command. Output
|
I think the benchmarks above can be explained by timsort's good performance on already sorted (or almost sorted) data:
However, I haven't checked whether these tests have sorted inputs. |
Oh wow that is an interesting. I'm guessing for non-sorted quicksort is still faster? In [9]: data = np.random.rand(size)
In [10]: %timeit np.sort(data, kind="quicksort")
750 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit np.sort(data, kind="stable")
878 ms ± 2.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) I wonder if there is a way to identify where we expect one sorting method to generally be better than the other |
I did a bit of searching, but not finding anyone doing this for timsort/quicksort specifically. There are other hybrid algorithms like fluxsort that are in this spirit, but doing things which are must more advanced than say taking a linear pass to decide between timsort and quicksort. |
I meant moreso in our codebase; if we think something is mostly sorted when hitting a certain codepath then the timsort would probably be better, if it is ever reasonable to make that assumption |
Sounds like we need more discussion in the issue before moving forward with a PR so closing this for now |
kind="stable"
? #53558In this PR I tried to find out all places where pandas internally uses
np.sort()
andnp.argsort()
and thatkind="stable"
may be necessary. In case I missed some of them, the following is a log ofgit grep
for.sort(
and.argsort(
. Those marked with[y]
are what I have changed and those marked with[n]
are what I think does not need to be changed (e.g. list sort; sorting unique elements; kind can be specified). Those with no labels are tests and I made no changes to them, except that I have changed allkind="mergesort"
tokind="stable"
based on the "explicit better than implicit" principle.git grep
logBesides, I manually reverted the changes in #53548 and they (at least) passed on my local machine. Yet I cannot guarantee that I have made all necessary changes or that all the changes I have made are proper, so I would be glad if maintainers can help with the inspection.
Note 1 (in
pandas/core/methods/selectn.py::SelectNSeries::compute
): It seems that whenn
is less than the length innlargest
andnsmallest
, pandas is already using a stable argsort but otherwise it callsDataFrame.sort_values
without specifyingkind="stable"
. In the current documentation, it says thatnlargest(n)
should be equivalent tosort_values(ascending=False).head(n)
but this does not seem to be correct withnumpy>=1.25
with AVX instructions. Alternatively it can be equivalent tosort_values(ascending=False, kind="stable").head(n)
usingkind="stable"
internally. Should we make this change or did I miss anything?Note 2 (in
pandas/core/groupby/groupby.py::GroupBy::_value_counts
): Here I specifiedkind="stable"
in bothsort_values
andsort_index
to obtain a stable result. The expected result inpandas/tests/groupby/test_value_counts.py::test_categorical_single_grouper_observed_false
is will be changed consequently.