Change in astype leads to no longer casting to numpy string dtypes #39945

phofl · 2021-02-20T22:40:20Z

closes BUG: Never end up with numpy string dtypes #39566
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

@jreback as discussed we no longer cast to numpy string dtypes after this

phofl · 2021-02-20T23:36:24Z

I am not familiar with the experimental data manager. What could be the reason that the tests fail there but not on azure? Additionally only the DataFrame tests seem to fail, not the series tests

jbrockmendel · 2021-02-21T00:34:44Z

What could be the reason that the tests fail there but not on azure?

they only are run if you specify add --array-manger to the pytest command, so only get run in that one build

phofl · 2021-02-21T11:06:46Z

Thx, this explains the behavior and the not raising in series methods.

@jorisvandenbossche You added a TODO about the numpy arrays in there. Did you have any concrete plans handling this case?

jbrockmendel · 2021-02-23T21:23:12Z

pandas/core/dtypes/cast.py

@@ -1268,7 +1268,7 @@ def soft_convert_objects(
            values = lib.maybe_convert_objects(
                values, convert_datetime=datetime, convert_timedelta=timedelta
            )
-        except (OutOfBoundsDatetime, ValueError):
+        except OutOfBoundsDatetime:


jbrockmendel · 2021-02-23T21:23:56Z

pandas/core/internals/array_manager.py

-            # if not isinstance(applied, ExtensionArray):
+            if not isinstance(applied, ExtensionArray):
+                if issubclass(applied.dtype.type, (str, bytes)):
+                    applied = np.array(applied, dtype=object)


i think we ought to have a sanitize_str_dtypes or something akin to sanitize_to_nanoseconds

There is a _sanitize_str_dtypes, but not fully sure it does exactly what we want here

This is now done as part of maybe_coerce_values, though that may be more heavy-weight than we want here

jorisvandenbossche

I think this will also need some more tests. Eg currently we also preserve the bytes dtype on setitem (df["new_column"] == np.array(..)), and not only for astype.
In addition, we need to explicitly test the content of the resulting series after the astype operation (are the values converted to bytes?)

jorisvandenbossche · 2021-02-24T09:27:28Z

pandas/core/internals/array_manager.py

-            # if not isinstance(applied, ExtensionArray):
+            if not isinstance(applied, ExtensionArray):
+                if issubclass(applied.dtype.type, (str, bytes)):
+                    applied = np.array(applied, dtype=object)


There is a _sanitize_str_dtypes, but not fully sure it does exactly what we want here

jorisvandenbossche · 2021-02-24T09:28:41Z

pandas/core/internals/array_manager.py

-            # if not isinstance(applied, ExtensionArray):
+            if not isinstance(applied, ExtensionArray):
+                if issubclass(applied.dtype.type, (str, bytes)):
+                    applied = np.array(applied, dtype=object)
            #     # TODO not all EA operations return new EAs (eg astype)
            #     applied = array(applied)


I think this commented out code was from the initial implementation at which point I only supported storing EAs in the ArrayManager, and not numpy.ndarrays. But now both are supported, so you can remove this code.

jbrockmendel · 2021-03-19T15:23:32Z

pandas/core/internals/array_manager.py

+        # if issubclass(dtype, (str, bytes)):
+        #     dtype = "object"
+        y = self.apply("astype", dtype=dtype, copy=copy)  # , errors=errors)
+        return y


can the commented-out code be removed?

github-actions · 2021-04-19T00:05:47Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

simonjayhawkins · 2021-05-10T14:07:24Z

@phofl can you merge master to resolve conflicts and address comments

simonjayhawkins · 2021-06-16T13:18:06Z

@phofl closing as stale. reopen when ready.

phofl · 2021-06-16T13:43:15Z

Yes will reopen when fixed, not much time right now unfortunately

Change in astype leads to no longer casting to numpy string dtypes

2e5fb9f

phofl added Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data labels Feb 20, 2021

Catch bytes arrays in array manager

ad91995

jbrockmendel reviewed Feb 23, 2021

View reviewed changes

jorisvandenbossche reviewed Feb 24, 2021

View reviewed changes

jbrockmendel reviewed Mar 19, 2021

View reviewed changes

github-actions bot added the Stale label Apr 19, 2021

simonjayhawkins closed this Jun 16, 2021

phofl deleted the 39566 branch April 27, 2023 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change in astype leads to no longer casting to numpy string dtypes #39945

Change in astype leads to no longer casting to numpy string dtypes #39945

phofl commented Feb 20, 2021

phofl commented Feb 20, 2021

jbrockmendel commented Feb 21, 2021

phofl commented Feb 21, 2021

jbrockmendel Feb 23, 2021

jbrockmendel Feb 23, 2021

jorisvandenbossche Feb 24, 2021

jbrockmendel Mar 19, 2021

jorisvandenbossche left a comment

jorisvandenbossche Feb 24, 2021

jorisvandenbossche Feb 24, 2021

jbrockmendel Mar 19, 2021

github-actions bot commented Apr 19, 2021

simonjayhawkins commented May 10, 2021

simonjayhawkins commented Jun 16, 2021

phofl commented Jun 16, 2021

Change in astype leads to no longer casting to numpy string dtypes #39945

Change in astype leads to no longer casting to numpy string dtypes #39945

Conversation

phofl commented Feb 20, 2021

phofl commented Feb 20, 2021

jbrockmendel commented Feb 21, 2021

phofl commented Feb 21, 2021

jbrockmendel Feb 23, 2021

Choose a reason for hiding this comment

jbrockmendel Feb 23, 2021

Choose a reason for hiding this comment

jorisvandenbossche Feb 24, 2021

Choose a reason for hiding this comment

jbrockmendel Mar 19, 2021

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 24, 2021

Choose a reason for hiding this comment

jorisvandenbossche Feb 24, 2021

Choose a reason for hiding this comment

jbrockmendel Mar 19, 2021

Choose a reason for hiding this comment

github-actions bot commented Apr 19, 2021

simonjayhawkins commented May 10, 2021

simonjayhawkins commented Jun 16, 2021

phofl commented Jun 16, 2021