-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Change in astype leads to no longer casting to numpy string dtypes #39945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I am not familiar with the experimental data manager. What could be the reason that the tests fail there but not on azure? Additionally only the DataFrame tests seem to fail, not the series tests |
they only are run if you specify add |
Thx, this explains the behavior and the not raising in series methods. @jorisvandenbossche You added a TODO about the numpy arrays in there. Did you have any concrete plans handling this case? |
@@ -1268,7 +1268,7 @@ def soft_convert_objects( | |||
values = lib.maybe_convert_objects( | |||
values, convert_datetime=datetime, convert_timedelta=timedelta | |||
) | |||
except (OutOfBoundsDatetime, ValueError): | |||
except OutOfBoundsDatetime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
# if not isinstance(applied, ExtensionArray): | ||
if not isinstance(applied, ExtensionArray): | ||
if issubclass(applied.dtype.type, (str, bytes)): | ||
applied = np.array(applied, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we ought to have a sanitize_str_dtypes or something akin to sanitize_to_nanoseconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a _sanitize_str_dtypes
, but not fully sure it does exactly what we want here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now done as part of maybe_coerce_values, though that may be more heavy-weight than we want here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will also need some more tests. Eg currently we also preserve the bytes dtype on setitem (df["new_column"] == np.array(..)
), and not only for astype
.
In addition, we need to explicitly test the content of the resulting series after the astype operation (are the values converted to bytes?)
# if not isinstance(applied, ExtensionArray): | ||
if not isinstance(applied, ExtensionArray): | ||
if issubclass(applied.dtype.type, (str, bytes)): | ||
applied = np.array(applied, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a _sanitize_str_dtypes
, but not fully sure it does exactly what we want here
# if not isinstance(applied, ExtensionArray): | ||
if not isinstance(applied, ExtensionArray): | ||
if issubclass(applied.dtype.type, (str, bytes)): | ||
applied = np.array(applied, dtype=object) | ||
# # TODO not all EA operations return new EAs (eg astype) | ||
# applied = array(applied) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this commented out code was from the initial implementation at which point I only supported storing EAs in the ArrayManager, and not numpy.ndarrays. But now both are supported, so you can remove this code.
# if issubclass(dtype, (str, bytes)): | ||
# dtype = "object" | ||
y = self.apply("astype", dtype=dtype, copy=copy) # , errors=errors) | ||
return y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the commented-out code be removed?
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@phofl can you merge master to resolve conflicts and address comments |
@phofl closing as stale. reopen when ready. |
Yes will reopen when fixed, not much time right now unfortunately |
@jreback as discussed we no longer cast to numpy string dtypes after this