Skip to content

TYP: _ensure_data and infer_dtype_from_array #44292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

simonjayhawkins
Copy link
Member

No description provided.

@simonjayhawkins simonjayhawkins added the Typing type annotations, mypy/pyright type checking label Nov 2, 2021
- datetimelike -> i8
- datetime64tz -> i8 (in local tz)
- categorical -> codes
- categorical[bool] without nulls -> uint8
- categorical[bool] with nulls -> ValueError: cannot convert float NaN to integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this tested/intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this was changed in #41256 although further investigation required on whether this is a latent bug/regression. Just updated the docstring for now to document the actual behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categorical is fast pathed in mode so does not pass through _ensure_data. So the regression fix in #42131 only required the except TypeError to fix.

In duplicated and drop_duplicates the categorical EA is passed through _ensure_data and so raises ValueError which is not caught by the fix in #42131.

So will need to change that but this is a regression from 1.2.5 so will need to be done separate so can be backported.

code sample based on test_drop_duplicates_categorical_bool

import pandas as pd

print(pd.__version__)
tc = pd.Series(
    pd.Categorical(
        [True, False, True, False, pd.NA], categories=[True, False], ordered=True
    )
)
print(tc.duplicated())
1.2.5
0    False
1    False
2     True
3     True
4    False
dtype: bool
1.3.4
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_47357/1277064552.py in <module>
      7     )
      8 )
----> 9 print(tc.duplicated())

~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/series.py in duplicated(self, keep)
   2215         dtype: bool
   2216         """
-> 2217         res = self._duplicated(keep=keep)
   2218         result = self._constructor(res, index=self.index)
   2219         return result.__finalize__(self, method="duplicated")

~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/base.py in _duplicated(self, keep)
   1230         self, keep: Literal["first", "last", False] = "first"
   1231     ) -> np.ndarray:
-> 1232         return duplicated(self._values, keep=keep)

~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in duplicated(values, keep)
    925     duplicated : ndarray[bool]
    926     """
--> 927     values, _ = _ensure_data(values)
    928     return htable.duplicated(values, keep=keep)
    929 

~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in _ensure_data(values)
    139             # i.e. all-bool Categorical, BooleanArray
    140             try:
--> 141                 return np.asarray(values).astype("uint8", copy=False), values.dtype
    142             except TypeError:
    143                 # GH#42107 we have pd.NAs present

ValueError: cannot convert float NaN to integer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #44351 and will convert this to draft till fixed.

@@ -112,16 +112,19 @@
# --------------- #
def _ensure_data(values: ArrayLike) -> np.ndarray:
"""
routine to ensure that our data is of the correct
input dtype for lower-level routines
Ensure values is of the correct input dtype for lower-level routines.

This will coerce:
- ints -> int64
- uint -> uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the ints and uints are unchanged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't yet check those. will look tomorrow.

# ndarray[Any, Any]], Union[Any, ExtensionDtype]]", expected
# "Tuple[ndarray[Any, Any], Union[dtype[Any], ExtensionDtype]]")
return values # type: ignore[return-value]
assert isinstance(values, np.ndarray) # for mypy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we potentially get here with PandasArray[complex]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this is coded to return the values whereas it would need to either extract the underlting numpy array or if not ndarray backed would need to coerce to numpy array. This is how it's done in above for is_float_dtype.

It used to be done this way before #42197. Those changes are in released pandas so I guess there are no 3rd party EA devs with issues.

The ignore was added in that PR and is not a false positive. We can either revert those changes or as I have done here, use an assert to fail fast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we could leave the ignore for now and add a TODO: This is NOT a false positive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking return np.asarray(values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, can also fix here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PandasArray[complex] can't be used to test as the numpy array is extracted from a PandasArray. So I guess will need to setup a dummy EA of complex dtype to test.

But, it also appears that we don't have tests where integer and floating EAs pass through _ensure_data. Need to investigate this further as we either need tests or can remove code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we always call extract_array(foo, extract_numpy=True) before getting here? if so, then a cast/ignore/assert seems benign.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 9, 2021
@jreback jreback removed this from the 1.4 milestone Dec 24, 2021
@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

status here?

@jreback
Copy link
Contributor

jreback commented Mar 6, 2022

@simonjayhawkins status of this?

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have time to merge main and continue.

@mroeschke mroeschke closed this Apr 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants