-
Notifications
You must be signed in to change notification settings - Fork 53
Specify NaN behaviour in unique()
#249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
IMO this depends on the definition of "unique". If it means "value equality", it should return all occurrences of NaN, since If I'm not mistaken, in this context "unique" means value equality, since we are dealing with values and not objects (is |
Are there clear use cases for returning multiple |
How do pytorch and tensorflow (and any others) handle this? Do they also have discussions about this behavior? |
For PyTorch I'm not aware of a discussion about this. As for the behavior, it seems to use the value equality approach I described in my comment above: >>> t = torch.tensor([1.0, float("nan"), 1.0, float("nan")])
>>> torch.unique(t)
tensor([nan, nan, 1.]) |
>>> float('nan') is float('nan')
False
(personally I favor the "return one |
@Zac-HD my bad, thanks for the heads-up. I've fixed my comment above. |
TensorFlow and CuPy also return multiple |
It'd be very annoying to implement on GPUs if we were to return a unique nan. In any case |
Hmm, that's a good point. Although it raises the question what users do then, they usually need to handle the nan's anyway by skipping or combining them. Your argument means adding an |
Could It could even be that a library which knows how it will return NaNs in |
As an aside, we discovered this behavior on the call today. In [1]: import numpy as np
In [2]: {float('nan'), float('nan')}
Out[2]: {nan, nan}
In [3]: {np.nan, np.nan}
Out[3]: {nan} |
From a suggestion in the call, what if a a_notnan = a[~np.isnan(a)]
r = np.unique(a_notnan) |
That is the same trap I fell into: Since In[1]: nan = float("nan")
In[2]: {nan, nan}
Out[2]: {nan} |
Yeah was thinking that might be the case. Thanks for sharing Philip 🙂 |
A limitation with this example specifically is that boolean indexing might not be supported by an Array API library (it's optional behaviour in the spec, see #84). You could manually zip an array |
In either the proposed I assume that we get the following if NaNs are unique to each-other: >>> x = xp.full(5, xp.nan)
>>> values, _, _, counts = xp.unique_all(x)
>>> values
Array([nan, nan, nan, nan, nan])
>>> counts
Array([1, 1, 1, 1, 1]) And likewise if NaNs are not unique to each-other: >>> x = xp.full(5, xp.nan)
>>> values, _, _, counts = xp.unique_all(x)
>>> values
Array([nan])
>>> counts
Array([5]) As you can see, the first scenario is a bit awkward in finding NaN counts as you'd still need to iterate over Also to note, counting behaviour does not help us get a consistent count of unique values. |
I think we have to make counts and values consistent with one another, regardless of which nan approach we take. It would be nonsensical to return something like |
Comparison among environments:
Downstream Libraries
Proposal
In short, the issue of multiple |
Say we have an array with multiple NaN values:
Should NaNs be treated as unique to one another?
Or should they be treated as the same?
I have no dog in the race but you might want to refer to some discussion bottom of numpy/numpy#19301 and in the NumPy mailing list that relates to a recent accidental "regression" in how
np.unique()
deals with NaNs.In either case, just specificity would prevent creating wrapper methods to standardise this behaviour. For example I created this (admittedly scrappy) utility method for HypothesisWorks/hypothesis#3065 to work for both scenarios, which @asmeurer noted was not ideal and probably a failure of the Array API spec.
And if deterministic NaN behaviour cannot be part of the API, a note should be added to say this behaviour is out of scope.
The text was updated successfully, but these errors were encountered: