Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451
Changes from 5 commits
63a7fc5
0eee625
607b95e
bca157d
79eb3b4
c063298
ab96aa4
bae8d65
31f1c33
cbd0820
864c166
d3ad7b0
028dc2c
1750bcb
7f4baf7
fdf1454
fe6fce6
70325d4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit a custom check (and we don't do anything similarly for other types), but given I initially overlooked a case where we were creating string arrays with the wrong missing value sentinel because the tests don't actually catch that (two arrays with different missing value sentinels still pass as equal in case of EAs), I would prefer keeping this in at least on the short term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to return self._na_value here to make things explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method (which has now been refactored to _str_map_nan_semantics) is slightly different in StringArray vs ArrowStringArray and im trying to sort out whether the differences are intentional or just cosmetic. could use some help from the author
the Arrow version handles this doing the check before map_infer_mask and changing the dtype passed there (also doesn't check for na_value_is_na)
the Arrow version sets na_value = np.nan/False on the analogue to L837/839 (again without a na_value_is_na check)
the Arrow version doesn't have the L831
convert = convert and not np.all(mask)
; AFAICT no existing tests rely on that lineThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, my claim in 3 about it not mattering was incorrect. it matters for test_contains_nan and test_empty_str_methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although an author who wrote this code almost 4 months ago ;)
Will take a closer look at it later today, but one quick find is that there were changes to the arrow version after I started this PR, so I might not have taken those into account in this version, eg #58483
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ive convinced myself that the arrow version doesnt need the na_value_is_na check bc it is always True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and that 'convert' is never used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this raise an error or not be possible in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some str methods are weird (i.e. what's In the comment here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And not only weird, there are some methods that genuinely return an object dtype (of course because of lack of a better proper dtype, but right not with the default dtype this is object dtype). For example
ser.str.split()
returns list elements.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. The list-returning functions are more good use cases for PDEP-13 #58455