-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: share astype code in MaskedArray #38490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, this is certainly one of the methods on the list of "to share" for masked arrays
@@ -229,6 +231,30 @@ def to_numpy( | |||
data = self._data.astype(dtype, copy=copy) | |||
return data | |||
|
|||
def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike: | |||
dtype = pandas_dtype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we know this method is only called from the subclasses, this line is not needed (each of the subclass methods already does it as well. The type annotation also indicates we already have a dtype object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type annotation also indicates we already have a dtype object)
heads up "Dtype" includes string; "DtypeObj" means a dtype object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. Can you then update the annotation to DtypeObj? Or actually, to ExtensionDtype? Or will mypy complain about that because the methods in the subclasses have a less strict annotation?
BTW, I would still remove this line
def astype(self, dtype: Dtype, copy: bool = True) -> ArrayLike: | ||
dtype = pandas_dtype(dtype) | ||
|
||
if is_dtype_equal(dtype, self.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here could in theory actually use if self.dtype == dtype
since we know we have an extension dtype, and those should never raise when being compared (avoiding yet another set of checks, not sure if it would ever be significant, though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im fine with it either way. this branch was broken off from one that was trying to push some astype boilerplate into a decorator so used the is_dtype_equal version
data = self._data.astype(dtype.numpy_dtype, copy=copy) | ||
# mask is copied depending on whether the data was copied, and | ||
# not directly depending on the `copy` keyword | ||
mask = self._mask if data is self._data else self._mask.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this code myself in the past, but I am not actually sure now this is needed. AFAIK the self._data.astype
(numpy's method) will always return a copy unless dtype.numpy_dtype
is the exact same type as the data. But that case should already be covered by the if is_dtype_equal(dtype, self.dtype)
above.
(now since this is copying existing code, fine to leave this for another issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, and i think you're right about the numpy behavior
pandas/core/arrays/masked.py
Outdated
cls = dtype.construct_array_type() | ||
return cls._from_sequence(self, dtype=dtype, copy=copy) | ||
|
||
return super().astype(dtype, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is in theory not reachable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not with the existing subclasses. this still seemed like the idiomatic thing to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally remove it then, if it's not reachable at the moment.
Then you can also remove the if isinstance(dtype, ExtensionDtype):
from the previous block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not wild about this method relying on the current behavior of the existing subclasses. better to conform to the standard patterns, even if it means a couple of extra checks that turn out to be unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also call it _astype
instead of calling super()
, to make it clearer it is a shared helper method, and not an actual (full) base implementation
(and then the method could also be properly typed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgrm merge if ok @jorisvandenbossche
also pls link to the issue & check the box |
updated per request + green |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
cc @jorisvandenbossche i dont know if this is the optimal way to do this, but it seems like there is a lot of share-ability here.