Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ENH: Add lazy copy to astype #50802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add lazy copy to astype #50802
Changes from 20 commits
ed2e322
1967a2a
45ea989
a2e1e53
8f29fda
55fc3bc
a1847e8
75bdf1d
528b0fb
f29a0be
35b9fa4
664c31b
49a2029
e1dc971
894aaa8
273b5aa
31557be
c0f5003
031aaa5
174e2d3
876d293
ff37e58
5492c10
620181c
4f6f954
95a8296
7640c36
469383d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think this is not always guaranteed to be a view? (but of course safer to return True than incorrectly assume it is always a copy)
is_string_dtype
also returns true for generic "object", and converting object to string is not necessarily no-copyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this branch, I see the following wrong behaviour in case you have mixed objects:
Because of not taking a copy, the
ensure_string_array
actually mutates the original values in place.For the case of "object -> some extension dtype" casting, we should probably always do
copy=True
, because I don't think we can rely on_from_sequence
to do the correct thing (although for this specific case I also don't think thatStringArray._from_sequence(arr, copy=False)
should mutatearr
in place. I would expect to only not make a copy if it's not needed, i.e. if it can just view the original data and no modification needs to be done)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #51073 to address the inplace modification.
You are correct, it is not guaranteed to be a no-copy op, but it could be and I couldn't figure out a more precise check. We can still optimise in a follow-up to get this stricter, for now I was aiming to get as many cases as possible without making it overly complex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining the True and to note something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our own extension types fall in this category? (eg casting object column of Period objects to period dtype, although this specific example is actually never a view)
Rather for a follow-up, would it be worth to have some more custom logic here for our own EAs? Both IntervalDtype and PeriodDtype have kind of
O
, but I think neither of them can ever cast from object dtype without making a copy.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep would prefer to do this as a follow up with specific tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to note for a follow-up, I assume now with multiple resolutions being supported for datetime64, we should maybe check those, since if you have different units, this is never a view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am wondering: what is the purpose of this check? Because if you cast datetime64[ns]->datetime64[ns], that's already covered by the equal dtype case. Mixing datetime and timedelta is something we disallow explicitly (numpy allows it). So this is for DatetimeTZDtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep tz aware stuff here. Removing this check causes test failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a bunch of additional tests and added a unit check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add back the part of the comment that you had before that this is for example for nullable dtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this will currently only work for eg Int64->Int32 (both nullable), and not catch the mixed of numpy/nullable dtype (eg int64 -> Int64 (view), or int64 -> Int32 (copy))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I missed this case. Will adjust accordingly, since this should work out of the box imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move the astype tests to a dedicated file instead of in the middle of the other methods? My hunch is that we might need to add some more astype tests (if we specialize more for our own dtypes), and test_methods.py is already getting long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't test this consistently for all methods here, but astype seems a sufficiently complicated case (not just based on a copy(deep=False) under the hood) that it's probably good to be complete.
(same for the ones below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also explicitly check that
df2._mgr._has_no_reference(0)
? Because the shares_memory and iloc setitem test doesn't ensure we didn't incorrectly trigger CoW unnecessarilyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a very good check, would have made sure that I did not miss the numpy dtype thing below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so this wasn't yet covered when updating the Series(series) constructor (#49524). Could you add an explicit test for copy/view behaviour with a case like above to
copy_view/test_constructors.py
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just parametrizing the first test there should be sufficient to cover this case:
We should still add a test that ensure that if the cast requires a copy, we do not track references (to avoid a unnecessary copy later on), but that can be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy case is more for your pr I guess? Adjusted the test accordingly