Skip to content

Change in astype leads to no longer casting to numpy string dtypes #39945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 20, 2021

@jreback as discussed we no longer cast to numpy string dtypes after this

@phofl phofl added Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data labels Feb 20, 2021
@phofl
Copy link
Member Author

phofl commented Feb 20, 2021

I am not familiar with the experimental data manager. What could be the reason that the tests fail there but not on azure? Additionally only the DataFrame tests seem to fail, not the series tests

@jbrockmendel
Copy link
Member

What could be the reason that the tests fail there but not on azure?

they only are run if you specify add --array-manger to the pytest command, so only get run in that one build

@phofl
Copy link
Member Author

phofl commented Feb 21, 2021

Thx, this explains the behavior and the not raising in series methods.

@jorisvandenbossche You added a TODO about the numpy arrays in there. Did you have any concrete plans handling this case?

@@ -1268,7 +1268,7 @@ def soft_convert_objects(
values = lib.maybe_convert_objects(
values, convert_datetime=datetime, convert_timedelta=timedelta
)
except (OutOfBoundsDatetime, ValueError):
except OutOfBoundsDatetime:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

# if not isinstance(applied, ExtensionArray):
if not isinstance(applied, ExtensionArray):
if issubclass(applied.dtype.type, (str, bytes)):
applied = np.array(applied, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we ought to have a sanitize_str_dtypes or something akin to sanitize_to_nanoseconds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a _sanitize_str_dtypes, but not fully sure it does exactly what we want here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done as part of maybe_coerce_values, though that may be more heavy-weight than we want here

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will also need some more tests. Eg currently we also preserve the bytes dtype on setitem (df["new_column"] == np.array(..)), and not only for astype.
In addition, we need to explicitly test the content of the resulting series after the astype operation (are the values converted to bytes?)

# if not isinstance(applied, ExtensionArray):
if not isinstance(applied, ExtensionArray):
if issubclass(applied.dtype.type, (str, bytes)):
applied = np.array(applied, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a _sanitize_str_dtypes, but not fully sure it does exactly what we want here

# if not isinstance(applied, ExtensionArray):
if not isinstance(applied, ExtensionArray):
if issubclass(applied.dtype.type, (str, bytes)):
applied = np.array(applied, dtype=object)
# # TODO not all EA operations return new EAs (eg astype)
# applied = array(applied)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this commented out code was from the initial implementation at which point I only supported storing EAs in the ArrayManager, and not numpy.ndarrays. But now both are supported, so you can remove this code.

# if issubclass(dtype, (str, bytes)):
# dtype = "object"
y = self.apply("astype", dtype=dtype, copy=copy) # , errors=errors)
return y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the commented-out code be removed?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 19, 2021
@simonjayhawkins
Copy link
Member

@phofl can you merge master to resolve conflicts and address comments

@simonjayhawkins
Copy link
Member

@phofl closing as stale. reopen when ready.

@phofl
Copy link
Member Author

phofl commented Jun 16, 2021

Yes will reopen when fixed, not much time right now unfortunately

@phofl phofl deleted the 39566 branch April 27, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Stale Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Never end up with numpy string dtypes
4 participants