Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908
Changes from 5 commits
4cb60e6
d242f2d
d39ab2c
2367810
9166d3b
8760705
d5b3fec
2c657df
647a6c2
0596fd7
c5a19c5
99680c9
69a6cc1
bd147ba
830275f
214e524
c9ba03c
7425536
68ac391
5cfa97a
74dbf96
3985943
3bda421
0c108a4
523e24c
279624c
80d231e
c5ced5a
459812c
d707b6b
71ccf24
daaac06
46626d1
3677bfa
42d382f
4fb1a0d
5d4eac1
15efb2e
b53cfe0
b7db53f
3399f08
e365f01
71d1e6c
9e23c35
c69a611
64b3206
d83a4ff
ef38660
aef1162
6247a5b
a6d066c
8adb08d
3ad0638
56714c9
6a1cc2b
1761a84
3e26baa
6b470b1
2ec6de0
a0b7a70
d9dcd20
4a37470
1d59c7a
e57c850
51f1b1d
fc95c06
ef02a43
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the docstring above for this new keyword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in d9dcd20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to delay looking up this option and allow the StringDtype to be initialized with
storage=None
. That could eg allow doings.astype("string")
preserving the original dtype if the series already is of string dtype (regardless of the default).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit (wip) started to add a parameterised fixture for more testing prior to implementing this.
since the dtype change is user facing this change worthwhile, there is still more to do as many tests use
dtype="string"
but does potentially make this PR harder to review. I will carry on and do the others if doing this makes sense.Have added
string[pyarrow]
to this fixture. This gives some additional failures for outstanding parts of ArrowStringArray. It maybe that to keep this PR more cleanly scoped, that we defer adding that for a follow-upThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't think this comment is resolved, so "unresolved" it in the UI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure can you open an issue explaining the api you'd like. I have updated _from_seqenece to accept the string "string", but if you are writing code if you do a comparison as a conditional, you expect the storage type to be fixed. and not then do something different. If the only places where this could be usefull is astype, we could put the logic there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global default -> python (at least currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually the behaviour here? Are "string[python]" and "string[arrow]" seen as not equal, I suppose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Not sure what the behavior should be if we delay looking up the storage. that could complicate things so will leave for a follow-on after better understanding of what we want, and when the lookup would be final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is ok, as we consider dtypes to be exactly equal. later on maybe we can allow equivalent tests (possibly open an issue about this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue about this is #36126. Tom had a PR #36136 with some discussion about it. Would need to look back, but I think the conclusion was indeed that StringDtype can just make this a normal method for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am think that we should add a
__str__
that doesn't include the storage. This is used in eg the output ofdf.dtypes
, and IMO for most users the fact that it's using "python" objects is an implementation detail, and should not be that visible (you can always check the actual dtype object to see what kind of storage it's using)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this would be fine, e.g.
string
is de-factostring[python]
, though this means we can essentially never change it andstring[pyarrow]
for non-default storage types. maybestring[object]
is more instructive?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, what about the global setting. the pyarrow backed string array should not be a second class citizen. At the moment if the global setting is changed , all tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What
"string"
means depends on the global setting, which is by default (and not de-facto)"string[python]"
.To be clear, I didn't want to suggest that "string" would always mean "string[python]". I just think that in the str representation (eg used in
df.dtypes
), we shouldn't show that detail.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't change to always ArrowStringArray, but still depend on what
self.storage
is (even when the data is coming from arrow, we still want to create python objects-backed string array if that's the default, I think)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed don't need to lazily import pyarrow anymore for the sake of the dtype object. But if we want to expose the ArrowStringArray in
pandas.core.arrays
like all other arrays, we are still going to need to keep this behind an ImportError ..(not sure if we want to expose ArrowStringArray, but until that's decided I would maybe leave those imports as is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no ArrowStringDtype