Skip to content

API: Disallow NaN in StringArray constructor #30980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 14, 2020

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jan 13, 2020

Closes #30966

There were a few ways we could have done this. I opted for this way since it still only requires a single pass over the data to validate the values. Other ways like checking for np.isnan after checking is_string_array would have required a second pass.

I changed the implementation in subsequent commits. The basic idea is the same, don't allow NaN in the array passed to StringArray, so that we only make a single pass over the data. We do this by changing StringValidator.is_valid_null to only allow NA. This required a small change to PandasArray._from_sequence, since previously we relied on creating a temporarily invalid StringArray before doing an inplace __setitem__ to replace NaNs with NA.

cc @tsvikas.

@TomAugspurger TomAugspurger added this to the 1.0.0 milestone Jan 13, 2020
@TomAugspurger TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data labels Jan 13, 2020

def __cinit__(self, Py_ssize_t n, dtype dtype=np.dtype(np.object_),
bint skipna=False):
bint skipna=False,
bint na_only=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if people don't like this, I can make a new validator that's specific to StringArray's checking.

class StringArrayValidator(StringArray):
    cdef bint is_valid_null(self, object value) except -1:
            return value is C_NA

Do people have a preference?

@TomAugspurger
Copy link
Contributor Author

Actually, we don't use is_string_array in all that many places, and I don't think it's public, so perhaps I can just make the change to exclude NaN / None. Will explore.

return cls(result)

@staticmethod
def _from_sequence_finalize(values, copy):
return values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the different name needed, and can't it just be _from_sequence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets StringArray customize a bit of the logic in the middle of _from_sequence. Without this StringArray._from_sequence raised on the super()._from_sequence, since the init would raise. This lets us still reuse PandasArray._from_sequence.

Happy to have suggestions for a better name. I didn't put any thought into this one. _finalize isn't the best since it's not happening after _from_sequence, it's in the middle.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the call chain is roughly

StringArray._from_sequnce
  -> PandasArray._from_sequnce (via the super)
  -> StringArray._from_sequnce_finalize
-> StringArray._from_sequence  (after the super finishes)

hopefully that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't notice that it still called super in the middle, understand now.

But, I would personally just not call super and implement all in StringArray._from_sequence. Yes, this gives a little bit of duplication (but really not a lot IMO), but I think that is better than the additional method name and harder to understand call chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to _coerce_from_sequence_values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've inlined things. Apologies for the force push, I'm not sure what happened.



def test_from_sequnce_no_mutate():
a = np.array(["a", pd.NA], dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.nan or None here? Otherwise the original won't be different from the potentially mutated one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yes. That's what I intended.

@jorisvandenbossche jorisvandenbossche changed the title Disallow NaN in StringArray constructor API: Disallow NaN in StringArray constructor Jan 14, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@jreback jreback merged commit 3471270 into pandas-dev:master Jan 14, 2020
@jreback
Copy link
Contributor

jreback commented Jan 14, 2020

thanks @TomAugspurger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str accessor functions returns float(NaN) instead of pd.NA
4 participants