TYP: _ensure_data and infer_dtype_from_array #44292

simonjayhawkins · 2021-11-02T21:06:55Z

No description provided.

jbrockmendel · 2021-11-02T22:06:44Z

pandas/core/algorithms.py

    - datetimelike -> i8
    - datetime64tz -> i8 (in local tz)
    - categorical -> codes
+    - categorical[bool] without nulls -> uint8
+    - categorical[bool] with nulls -> ValueError: cannot convert float NaN to integer


is this tested/intentional?

looks like this was changed in #41256 although further investigation required on whether this is a latent bug/regression. Just updated the docstring for now to document the actual behavior.

categorical is fast pathed in mode so does not pass through _ensure_data. So the regression fix in #42131 only required the except TypeError to fix.

In duplicated and drop_duplicates the categorical EA is passed through _ensure_data and so raises ValueError which is not caught by the fix in #42131.

So will need to change that but this is a regression from 1.2.5 so will need to be done separate so can be backported.

code sample based on test_drop_duplicates_categorical_bool

import pandas as pd print(pd.__version__) tc = pd.Series( pd.Categorical( [True, False, True, False, pd.NA], categories=[True, False], ordered=True ) ) print(tc.duplicated())

1.2.5 0 False 1 False 2 True 3 True 4 False dtype: bool

1.3.4 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_47357/1277064552.py in <module> 7 ) 8 ) ----> 9 print(tc.duplicated()) ~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/series.py in duplicated(self, keep) 2215 dtype: bool 2216 """ -> 2217 res = self._duplicated(keep=keep) 2218 result = self._constructor(res, index=self.index) 2219 return result.__finalize__(self, method="duplicated") ~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/base.py in _duplicated(self, keep) 1230 self, keep: Literal["first", "last", False] = "first" 1231 ) -> np.ndarray: -> 1232 return duplicated(self._values, keep=keep) ~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in duplicated(values, keep) 925 duplicated : ndarray[bool] 926 """ --> 927 values, _ = _ensure_data(values) 928 return htable.duplicated(values, keep=keep) 929 ~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in _ensure_data(values) 139 # i.e. all-bool Categorical, BooleanArray 140 try: --> 141 return np.asarray(values).astype("uint8", copy=False), values.dtype 142 except TypeError: 143 # GH#42107 we have pd.NAs present ValueError: cannot convert float NaN to integer

opened #44351 and will convert this to draft till fixed.

jbrockmendel · 2021-11-02T22:07:11Z

pandas/core/algorithms.py

@@ -112,16 +112,19 @@
 # --------------- #
 def _ensure_data(values: ArrayLike) -> np.ndarray:
    """
-    routine to ensure that our data is of the correct
-    input dtype for lower-level routines
+    Ensure values is of the correct input dtype for lower-level routines.

    This will coerce:
    - ints -> int64
    - uint -> uint64


i think the ints and uints are unchanged

i didn't yet check those. will look tomorrow.

jbrockmendel · 2021-11-02T22:07:51Z

pandas/core/algorithms.py

-        # ndarray[Any, Any]], Union[Any, ExtensionDtype]]", expected
-        # "Tuple[ndarray[Any, Any], Union[dtype[Any], ExtensionDtype]]")
-        return values  # type: ignore[return-value]
+        assert isinstance(values, np.ndarray)  # for mypy


could we potentially get here with PandasArray[complex]?

yes. this is coded to return the values whereas it would need to either extract the underlting numpy array or if not ndarray backed would need to coerce to numpy array. This is how it's done in above for is_float_dtype.

It used to be done this way before #42197. Those changes are in released pandas so I guess there are no 3rd party EA devs with issues.

The ignore was added in that PR and is not a false positive. We can either revert those changes or as I have done here, use an assert to fail fast.

or we could leave the ignore for now and add a TODO: This is NOT a false positive

i was thinking return np.asarray(values)

yep, can also fix here.

PandasArray[complex] can't be used to test as the numpy array is extracted from a PandasArray. So I guess will need to setup a dummy EA of complex dtype to test.

But, it also appears that we don't have tests where integer and floating EAs pass through _ensure_data. Need to investigate this further as we either need tests or can remove code.

so we always call extract_array(foo, extract_numpy=True) before getting here? if so, then a cast/ignore/assert seems benign.

github-actions · 2021-12-09T00:03:20Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2022-01-16T19:13:16Z

status here?

jreback · 2022-03-06T23:18:51Z

@simonjayhawkins status of this?

mroeschke · 2022-04-24T03:13:35Z

Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have time to merge main and continue.

simonjayhawkins added 3 commits November 2, 2021 20:22

overload infer_dtype_from_array

5804098

is_complex_dtype in _ensure_data

2caa77d

update _ensure_data docstring

4f9a72d

simonjayhawkins added the Typing type annotations, mypy/pyright type checking label Nov 2, 2021

jbrockmendel reviewed Nov 2, 2021

View reviewed changes

jreback added this to the 1.4 milestone Nov 5, 2021

simonjayhawkins mentioned this pull request Nov 8, 2021

REGR: Series.duplicated with category dtype and nulls raises ValueError #44351

Closed

3 tasks

simonjayhawkins marked this pull request as draft November 8, 2021 16:08

github-actions bot added the Stale label Dec 9, 2021

jreback removed this from the 1.4 milestone Dec 24, 2021

mroeschke closed this Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TYP: _ensure_data and infer_dtype_from_array #44292

TYP: _ensure_data and infer_dtype_from_array #44292

simonjayhawkins commented Nov 2, 2021

jbrockmendel Nov 2, 2021

simonjayhawkins Nov 2, 2021

simonjayhawkins Nov 3, 2021

simonjayhawkins Nov 8, 2021

jbrockmendel Nov 2, 2021

simonjayhawkins Nov 2, 2021

jbrockmendel Nov 2, 2021

simonjayhawkins Nov 2, 2021

simonjayhawkins Nov 2, 2021

jbrockmendel Nov 2, 2021

simonjayhawkins Nov 2, 2021

simonjayhawkins Nov 3, 2021

jbrockmendel Nov 3, 2021

github-actions bot commented Dec 9, 2021

jreback commented Jan 16, 2022

jreback commented Mar 6, 2022

mroeschke commented Apr 24, 2022

TYP: _ensure_data and infer_dtype_from_array #44292

TYP: _ensure_data and infer_dtype_from_array #44292

Conversation

simonjayhawkins commented Nov 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 9, 2021

jreback commented Jan 16, 2022

jreback commented Mar 6, 2022

mroeschke commented Apr 24, 2022