-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Disallow NaN in StringArray constructor #30980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/_libs/lib.pyx
Outdated
|
||
def __cinit__(self, Py_ssize_t n, dtype dtype=np.dtype(np.object_), | ||
bint skipna=False): | ||
bint skipna=False, | ||
bint na_only=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if people don't like this, I can make a new validator that's specific to StringArray's checking.
class StringArrayValidator(StringArray):
cdef bint is_valid_null(self, object value) except -1:
return value is C_NA
Do people have a preference?
Actually, we don't use |
pandas/core/arrays/numpy_.py
Outdated
return cls(result) | ||
|
||
@staticmethod | ||
def _from_sequence_finalize(values, copy): | ||
return values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the different name needed, and can't it just be _from_sequence
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lets StringArray customize a bit of the logic in the middle of _from_sequence
. Without this StringArray._from_sequence raised on the super()._from_sequence
, since the init would raise. This lets us still reuse PandasArray._from_sequence.
Happy to have suggestions for a better name. I didn't put any thought into this one. _finalize
isn't the best since it's not happening after _from_sequence
, it's in the middle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the call chain is roughly
StringArray._from_sequnce
-> PandasArray._from_sequnce (via the super)
-> StringArray._from_sequnce_finalize
-> StringArray._from_sequence (after the super finishes)
hopefully that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't notice that it still called super in the middle, understand now.
But, I would personally just not call super and implement all in StringArray._from_sequence
. Yes, this gives a little bit of duplication (but really not a lot IMO), but I think that is better than the additional method name and harder to understand call chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to _coerce_from_sequence_values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've inlined things. Apologies for the force push, I'm not sure what happened.
|
||
|
||
def test_from_sequnce_no_mutate(): | ||
a = np.array(["a", pd.NA], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use np.nan or None here? Otherwise the original won't be different from the potentially mutated one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, yes. That's what I intended.
b34b600
to
bbe2196
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
thanks @TomAugspurger |
) Co-authored-by: Tom Augspurger <[email protected]>
Closes #30966
There were a few ways we could have done this. I opted for this way since it still only requires a single pass over the data to validate the values. Other ways like checking fornp.isnan
after checkingis_string_array
would have required a second pass.I changed the implementation in subsequent commits. The basic idea is the same, don't allow NaN in the array passed to StringArray, so that we only make a single pass over the data. We do this by changing
StringValidator.is_valid_null
to only allow NA. This required a small change toPandasArray._from_sequence
, since previously we relied on creating a temporarily invalid StringArray before doing an inplace__setitem__
to replace NaNs with NA.cc @tsvikas.