Add DataFrame.sort and Column.sort #234

MarcoGorelli · 2023-08-22T17:13:29Z

Feedback I received when showing this to people is that

col.get_rows(col.sorted_indices())

isn't ergonomic

It's also a performance footgun

In [1]: df = pl.DataFrame({'a': np.random.randint(0, 4, size=100_000_000)})

In [2]: %timeit df[df['a'].arg_sort()]
2.56 s ± 80.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df.sort('a')
700 ms ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

kkraus14 · 2023-08-23T22:16:53Z

That performance difference for Polars is quite surprising. Does df.__getitem__ do a lot of extra bounds checking or something else that could be bypassed in the case where you know the indices being passed to it are generated with guarantees?

rgommers

This sounds very reasonable to add, LGTM.

It would probably be useful to edit the sorted_indices docstrings, since it contains:

        If you need to sort the DataFrame, you can simply do::

            df.get_rows(df.sorted_indices(keys))

instead, that can now refer to the sort method

MarcoGorelli · 2023-08-24T08:27:50Z

That performance difference for Polars is quite surprising. Does df.__getitem__ do a lot of extra bounds checking or something else that could be bypassed in the case where you know the indices being passed to it are generated with guarantees?

there are no guarantees in this case

anyway, you can observe the same in numpy:

In [14]: arr = np.random.randint(0, 4, size=100_000_000)

In [15]: %time arr[np.argsort(arr)]
CPU times: user 767 ms, sys: 2.13 s, total: 2.89 s
Wall time: 2.9 s
Out[15]: array([0, 0, 0, ..., 3, 3, 3])

In [16]: %time np.sort(arr)
CPU times: user 253 ms, sys: 187 ms, total: 440 ms
Wall time: 439 ms
Out[16]: array([0, 0, 0, ..., 3, 3, 3])

add sort

c11ea66

MarcoGorelli requested review from kkraus14 and rgommers August 23, 2023 10:02

kkraus14 approved these changes Aug 23, 2023

View reviewed changes

rgommers added the API design label Aug 24, 2023

rgommers approved these changes Aug 24, 2023

View reviewed changes

update sorted_indices docs

542f749

MarcoGorelli merged commit 77bc66b into data-apis:main Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add DataFrame.sort and Column.sort #234

Add DataFrame.sort and Column.sort #234

Uh oh!

MarcoGorelli commented Aug 22, 2023 •

edited

Loading

Uh oh!

kkraus14 commented Aug 23, 2023

Uh oh!

rgommers left a comment

Uh oh!

MarcoGorelli commented Aug 24, 2023

Uh oh!

Uh oh!

Add DataFrame.sort and Column.sort #234

Add DataFrame.sort and Column.sort #234

Uh oh!

Conversation

MarcoGorelli commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkraus14 commented Aug 23, 2023

Uh oh!

rgommers left a comment

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Aug 24, 2023

Uh oh!

Uh oh!

MarcoGorelli commented Aug 22, 2023 •

edited

Loading