Skip to content

Add DataFrame.sort and Column.sort #234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 24, 2023

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Aug 22, 2023

Feedback I received when showing this to people is that

col.get_rows(col.sorted_indices())

isn't ergonomic

It's also a performance footgun

In [1]: df = pl.DataFrame({'a': np.random.randint(0, 4, size=100_000_000)})

In [2]: %timeit df[df['a'].arg_sort()]
2.56 s ± 80.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df.sort('a')
700 ms ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@kkraus14
Copy link
Collaborator

That performance difference for Polars is quite surprising. Does df.__getitem__ do a lot of extra bounds checking or something else that could be bypassed in the case where you know the indices being passed to it are generated with guarantees?

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds very reasonable to add, LGTM.

It would probably be useful to edit the sorted_indices docstrings, since it contains:

        If you need to sort the DataFrame, you can simply do::

            df.get_rows(df.sorted_indices(keys))

instead, that can now refer to the sort method

@MarcoGorelli
Copy link
Contributor Author

That performance difference for Polars is quite surprising. Does df.__getitem__ do a lot of extra bounds checking or something else that could be bypassed in the case where you know the indices being passed to it are generated with guarantees?

there are no guarantees in this case

anyway, you can observe the same in numpy:

In [14]: arr = np.random.randint(0, 4, size=100_000_000)

In [15]: %time arr[np.argsort(arr)]
CPU times: user 767 ms, sys: 2.13 s, total: 2.89 s
Wall time: 2.9 s
Out[15]: array([0, 0, 0, ..., 3, 3, 3])

In [16]: %time np.sort(arr)
CPU times: user 253 ms, sys: 187 ms, total: 440 ms
Wall time: 439 ms
Out[16]: array([0, 0, 0, ..., 3, 3, 3])

@MarcoGorelli MarcoGorelli merged commit 77bc66b into data-apis:main Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants