API: Disallow NaN in StringArray constructor #30980

TomAugspurger · 2020-01-13T19:21:36Z

There were a few ways we could have done this. I opted for this way since it still only requires a single pass over the data to validate the values. Other ways like checking for np.isnan after checking is_string_array would have required a second pass.

I changed the implementation in subsequent commits. The basic idea is the same, don't allow NaN in the array passed to StringArray, so that we only make a single pass over the data. We do this by changing StringValidator.is_valid_null to only allow NA. This required a small change to PandasArray._from_sequence, since previously we relied on creating a temporarily invalid StringArray before doing an inplace __setitem__ to replace NaNs with NA.

cc @tsvikas.

TomAugspurger · 2020-01-13T19:24:24Z

pandas/_libs/lib.pyx


    def __cinit__(self, Py_ssize_t n, dtype dtype=np.dtype(np.object_),
-                  bint skipna=False):
+                  bint skipna=False,
+                  bint na_only=False):


Alternatively, if people don't like this, I can make a new validator that's specific to StringArray's checking.

class StringArrayValidator(StringArray): cdef bint is_valid_null(self, object value) except -1: return value is C_NA

Do people have a preference?

TomAugspurger · 2020-01-13T19:27:15Z

Actually, we don't use is_string_array in all that many places, and I don't think it's public, so perhaps I can just make the change to exclude NaN / None. Will explore.

pandas/tests/arrays/string_/test_string.py

jorisvandenbossche · 2020-01-13T19:52:57Z

pandas/core/arrays/numpy_.py

        return cls(result)

+    @staticmethod
+    def _from_sequence_finalize(values, copy):
+        return values


Why is the different name needed, and can't it just be _from_sequence?

This lets StringArray customize a bit of the logic in the middle of _from_sequence. Without this StringArray._from_sequence raised on the super()._from_sequence, since the init would raise. This lets us still reuse PandasArray._from_sequence.

Happy to have suggestions for a better name. I didn't put any thought into this one. _finalize isn't the best since it's not happening after _from_sequence, it's in the middle.

So the call chain is roughly

StringArray._from_sequnce -> PandasArray._from_sequnce (via the super) -> StringArray._from_sequnce_finalize -> StringArray._from_sequence (after the super finishes)

hopefully that makes sense.

Ah, I didn't notice that it still called super in the middle, understand now.

But, I would personally just not call super and implement all in StringArray._from_sequence. Yes, this gives a little bit of duplication (but really not a lot IMO), but I think that is better than the additional method name and harder to understand call chain.

Renamed to _coerce_from_sequence_values

OK, I've inlined things. Apologies for the force push, I'm not sure what happened.

jorisvandenbossche · 2020-01-13T19:54:32Z

pandas/tests/arrays/string_/test_string.py

+
+
+def test_from_sequnce_no_mutate():
+    a = np.array(["a", pd.NA], dtype=object)


use np.nan or None here? Otherwise the original won't be different from the potentially mutated one?

Whoops, yes. That's what I intended.

Closes pandas-dev#30966

pandas/_libs/lib.pyx

pandas/core/arrays/numpy_.py

pandas/tests/arrays/string_/test_string.py

jorisvandenbossche

Looks good to me

jreback · 2020-01-14T12:39:51Z

thanks @TomAugspurger

…uctor

) Co-authored-by: Tom Augspurger <[email protected]>

TomAugspurger added this to the 1.0.0 milestone Jan 13, 2020

TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data labels Jan 13, 2020

TomAugspurger commented Jan 13, 2020

View reviewed changes

WillAyd requested changes Jan 13, 2020

View reviewed changes

pandas/tests/arrays/string_/test_string.py Show resolved Hide resolved

jorisvandenbossche reviewed Jan 13, 2020

View reviewed changes

TomAugspurger added 10 commits January 13, 2020 14:16

Disallow NaN in StringArray constructor

6f3e367

Closes pandas-dev#30966

change it

1e62f26

update test

5e720ce

update test

17b6f10

test NaT

0e2468a

fixup

21e6e59

fixup

f680db9

inline from_sequence

fad5b7b

fixup docstring

4c2416d

Merge remote-tracking branch 'upstream/master' into 30966-str-validate

bbe2196

TomAugspurger force-pushed the 30966-str-validate branch from b34b600 to bbe2196 Compare January 13, 2020 22:42

jreback requested changes Jan 14, 2020

View reviewed changes

pandas/_libs/lib.pyx Show resolved Hide resolved

pandas/core/arrays/numpy_.py Outdated Show resolved Hide resolved

pandas/tests/arrays/string_/test_string.py Outdated Show resolved Hide resolved

TomAugspurger added 2 commits January 14, 2020 06:05

Merge remote-tracking branch 'upstream/master' into 30966-str-validate

d6da25f

fixups

08e049d

jorisvandenbossche changed the title ~~Disallow NaN in StringArray constructor~~ API: Disallow NaN in StringArray constructor Jan 14, 2020

jorisvandenbossche approved these changes Jan 14, 2020

View reviewed changes

jreback approved these changes Jan 14, 2020

View reviewed changes

jreback merged commit 3471270 into pandas-dev:master Jan 14, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 14, 2020

Backport PR pandas-dev#30980: API: Disallow NaN in StringArray constr…

29b7350

…uctor

meeseeksmachine mentioned this pull request Jan 14, 2020

Backport PR #30980 on branch 1.0.x (API: Disallow NaN in StringArray constructor) #31000

Merged

simonjayhawkins pushed a commit that referenced this pull request Jan 14, 2020

Backport PR #30980: API: Disallow NaN in StringArray constructor (#31000

8140466

) Co-authored-by: Tom Augspurger <[email protected]>

lithomas1 mentioned this pull request May 10, 2021

API: allow nan-likes in StringArray constructor #41412

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Disallow NaN in StringArray constructor #30980

API: Disallow NaN in StringArray constructor #30980

TomAugspurger commented Jan 13, 2020 •

edited

Loading

TomAugspurger Jan 13, 2020

TomAugspurger commented Jan 13, 2020

jorisvandenbossche Jan 13, 2020

TomAugspurger Jan 13, 2020

TomAugspurger Jan 13, 2020 •

edited

Loading

jorisvandenbossche Jan 13, 2020

TomAugspurger Jan 13, 2020

TomAugspurger Jan 13, 2020

jorisvandenbossche Jan 13, 2020

TomAugspurger Jan 13, 2020

jorisvandenbossche left a comment

jreback commented Jan 14, 2020



		def test_from_sequnce_no_mutate():
		a = np.array(["a", pd.NA], dtype=object)

API: Disallow NaN in StringArray constructor #30980

API: Disallow NaN in StringArray constructor #30980

Conversation

TomAugspurger commented Jan 13, 2020 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Jan 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Jan 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jreback commented Jan 14, 2020

TomAugspurger commented Jan 13, 2020 •

edited

Loading

TomAugspurger Jan 13, 2020 •

edited

Loading