add `Column.unique_indices` #151

MarcoGorelli · 2023-04-21T17:53:32Z

closes #135

(related: I'll make a PR next week to unify keys and labels, there's some inconsistencies)

kkraus14 · 2023-04-25T19:58:49Z

Should we note how null values are or aren't handled here? (and possibly NaN values?)

rgommers · 2023-04-26T13:54:57Z

Should we note how null values are or aren't handled here? (and possibly NaN values?)

For nan, how about copying the array API's unique semantics: https://data-apis.org/array-api/draft/API_specification/generated/array_api.unique_values.html? Keeps things consistent. I'll also note that:

numpy.unique has an equal_nan=True keyword
pandas.Series doesn't document what it's doing, but considers them equal

pandas.DataFrame doesn't have a unique method - do we need it here at the dataframe level?

For null values, should they simply be discarded? They're missing, and I'd expect unique to return a set of the non-missing data in the column.

MarcoGorelli · 2023-04-26T15:20:16Z

pandas.DataFrame doesn't have a unique method - do we need it here at the dataframe level?

TBH we can probably just keep it out for now

I'd keep null values, as a user I'd prefer to know if there are null values

MarcoGorelli · 2023-04-26T15:29:14Z

For nan, how about copying the array API's unique semantics: https://data-apis.org/array-api/draft/API_specification/generated/array_api.unique_values.html? Keeps things consistent

let's bring this up on the call

rgommers · 2023-04-27T10:20:00Z

For nan, how about copying the array API's unique semantics: https://data-apis.org/array-api/draft/API_specification/generated/array_api.unique_values.html? Keeps things consistent

let's bring this up on the call

Actually, this is also covered in #128 (comment). Let's follow that (say "it's implementation-specific, NaN position isn't guaranteed"). I feel like we've been over this a lot, and it keeps on coming up on various PRs. So let's move this along I suggest, and ignore my proposal to make it match the array standard.

Probably deserves a separate design topics page on NaN and null handling that can be linked to, and we use as a reference whenever this comes up? I can write that.

jorisvandenbossche · 2023-04-27T13:04:21Z

I'd keep null values, as a user I'd prefer to know if there are null values

Another option is to have a keyword for this (like reductions also have one, and in pandas groupby() and value_counts() also have a keyword for this).

A case where you might not want the null values: if you use unique to mimic iterating over groupby groups (which we decided to not support), getting the null wouldn't work in a subsequent loop subsetting the dataframe (df.get_rows_by_mask(df.get_column_by_name(col) == unique_val))

kkraus14 · 2023-04-27T16:56:09Z

Should we consider making this unique_indices similar to what we did for sorted_indices? Is there ever a case where someone wants to know where they should go for the unique values as opposed to getting the output values directly?

It's cheap to go from the indices to the values, but expensive to go from the values to the indices.

rgommers · 2023-04-27T20:22:31Z

Some points we discussed today:

We want to return a single nan only, not multiple. This is what existing dataframe libraries do. The array API standard choice for multiple nans was driven by performance (it's faster on GPU); array libraries are more flop-constrained than dataframes; for dataframes the extra computation should be fine. And it'd be less implementation work.
We want to indeed return indices with unique_indices, not values with unique.
- There are use cases for getting indices from col.unique() and then using that to retrieve rows from a dataframe; the non-determinism is not always a problem (e.g., the subset keyword of pd.DataFrame.value_counts) has this behavior).
We're fine with no ordering is guaranteed, and the function is not necessarily even deterministic in case of duplicate value. No guarantees beyond "returns an index to each unique value in the column".

spec/API_specification/dataframe_api/column_object.py

…oever

rgommers

LGTM too

spec/API_specification/dataframe_api/column_object.py

rgommers · 2023-05-04T21:47:51Z

Looks good now, so in it went. Thanks Marco and Keith!

add unique

69fe58e

rgommers added the API design label Apr 26, 2023

MarcoGorelli added 3 commits May 2, 2023 10:43

Merge remote-tracking branch 'upstream/main' into unique

44cee89

fixup

0530e32

punt on DataFrame.unique for now

d7aaa33

kkraus14 reviewed May 2, 2023

View reviewed changes

spec/API_specification/dataframe_api/column_object.py Outdated Show resolved Hide resolved

clarify that there are really absolutely no ordering guarantees whats…

160c264

…oever

kkraus14 approved these changes May 3, 2023

View reviewed changes

rgommers changed the title ~~add DataFrame.unique and Column.unique~~ add Column.unique_indices May 3, 2023

rgommers approved these changes May 3, 2023

View reviewed changes

kkraus14 reviewed May 3, 2023

View reviewed changes

spec/API_specification/dataframe_api/column_object.py Outdated Show resolved Hide resolved

MarcoGorelli added 2 commits May 4, 2023 10:21

fixup example

ab74976

make skip_nulls default true

c0cefa2

kkraus14 reviewed May 4, 2023

View reviewed changes

spec/API_specification/dataframe_api/column_object.py Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into unique

3804f9c

kkraus14 approved these changes May 4, 2023

View reviewed changes

rgommers merged commit 1f68bdd into data-apis:main May 4, 2023

rgommers mentioned this pull request Jul 4, 2023

Add DataFrame.unique_indices #194

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `Column.unique_indices` #151

add `Column.unique_indices` #151

MarcoGorelli commented Apr 21, 2023 •

edited

Loading

kkraus14 commented Apr 25, 2023

rgommers commented Apr 26, 2023

MarcoGorelli commented Apr 26, 2023

MarcoGorelli commented Apr 26, 2023

rgommers commented Apr 27, 2023

jorisvandenbossche commented Apr 27, 2023

kkraus14 commented Apr 27, 2023

rgommers commented Apr 27, 2023

rgommers left a comment

rgommers commented May 4, 2023

add Column.unique_indices #151

add Column.unique_indices #151

Conversation

MarcoGorelli commented Apr 21, 2023 • edited Loading

kkraus14 commented Apr 25, 2023

rgommers commented Apr 26, 2023

MarcoGorelli commented Apr 26, 2023

MarcoGorelli commented Apr 26, 2023

rgommers commented Apr 27, 2023

jorisvandenbossche commented Apr 27, 2023

kkraus14 commented Apr 27, 2023

rgommers commented Apr 27, 2023

rgommers left a comment

Choose a reason for hiding this comment

rgommers commented May 4, 2023

add `Column.unique_indices` #151

add `Column.unique_indices` #151

MarcoGorelli commented Apr 21, 2023 •

edited

Loading