Skip to content

add Column.unique_indices #151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 4, 2023
Merged

add Column.unique_indices #151

merged 8 commits into from
May 4, 2023

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Apr 21, 2023

closes #135

(related: I'll make a PR next week to unify keys and labels, there's some inconsistencies)

@kkraus14
Copy link
Collaborator

Should we note how null values are or aren't handled here? (and possibly NaN values?)

@rgommers
Copy link
Member

Should we note how null values are or aren't handled here? (and possibly NaN values?)

For nan, how about copying the array API's unique semantics: https://data-apis.org/array-api/draft/API_specification/generated/array_api.unique_values.html? Keeps things consistent. I'll also note that:

pandas.DataFrame doesn't have a unique method - do we need it here at the dataframe level?

For null values, should they simply be discarded? They're missing, and I'd expect unique to return a set of the non-missing data in the column.

@MarcoGorelli
Copy link
Contributor Author

pandas.DataFrame doesn't have a unique method - do we need it here at the dataframe level?

TBH we can probably just keep it out for now

I'd keep null values, as a user I'd prefer to know if there are null values

@MarcoGorelli
Copy link
Contributor Author

For nan, how about copying the array API's unique semantics: https://data-apis.org/array-api/draft/API_specification/generated/array_api.unique_values.html? Keeps things consistent

let's bring this up on the call

@rgommers
Copy link
Member

For nan, how about copying the array API's unique semantics: https://data-apis.org/array-api/draft/API_specification/generated/array_api.unique_values.html? Keeps things consistent

let's bring this up on the call

Actually, this is also covered in #128 (comment). Let's follow that (say "it's implementation-specific, NaN position isn't guaranteed"). I feel like we've been over this a lot, and it keeps on coming up on various PRs. So let's move this along I suggest, and ignore my proposal to make it match the array standard.

Probably deserves a separate design topics page on NaN and null handling that can be linked to, and we use as a reference whenever this comes up? I can write that.

@jorisvandenbossche
Copy link
Member

I'd keep null values, as a user I'd prefer to know if there are null values

Another option is to have a keyword for this (like reductions also have one, and in pandas groupby() and value_counts() also have a keyword for this).

A case where you might not want the null values: if you use unique to mimic iterating over groupby groups (which we decided to not support), getting the null wouldn't work in a subsequent loop subsetting the dataframe (df.get_rows_by_mask(df.get_column_by_name(col) == unique_val))

@kkraus14
Copy link
Collaborator

Should we consider making this unique_indices similar to what we did for sorted_indices? Is there ever a case where someone wants to know where they should go for the unique values as opposed to getting the output values directly?

It's cheap to go from the indices to the values, but expensive to go from the values to the indices.

@rgommers
Copy link
Member

Some points we discussed today:

  • We want to return a single nan only, not multiple. This is what existing dataframe libraries do. The array API standard choice for multiple nans was driven by performance (it's faster on GPU); array libraries are more flop-constrained than dataframes; for dataframes the extra computation should be fine. And it'd be less implementation work.
  • We want to indeed return indices with unique_indices, not values with unique.
    • There are use cases for getting indices from col.unique() and then using that to retrieve rows from a dataframe; the non-determinism is not always a problem (e.g., the subset keyword of pd.DataFrame.value_counts) has this behavior).
  • We're fine with no ordering is guaranteed, and the function is not necessarily even deterministic in case of duplicate value. No guarantees beyond "returns an index to each unique value in the column".

@rgommers rgommers changed the title add DataFrame.unique and Column.unique add Column.unique_indices May 3, 2023
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

@rgommers rgommers merged commit 1f68bdd into data-apis:main May 4, 2023
@rgommers
Copy link
Member

rgommers commented May 4, 2023

Looks good now, so in it went. Thanks Marco and Keith!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Column.unique
4 participants