-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TYP: _ensure_data and infer_dtype_from_array #44292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- datetimelike -> i8 | ||
- datetime64tz -> i8 (in local tz) | ||
- categorical -> codes | ||
- categorical[bool] without nulls -> uint8 | ||
- categorical[bool] with nulls -> ValueError: cannot convert float NaN to integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this tested/intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this was changed in #41256 although further investigation required on whether this is a latent bug/regression. Just updated the docstring for now to document the actual behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categorical is fast pathed in mode
so does not pass through _ensure_data
. So the regression fix in #42131 only required the except TypeError
to fix.
In duplicated
and drop_duplicates
the categorical EA is passed through _ensure_data
and so raises ValueError which is not caught by the fix in #42131.
So will need to change that but this is a regression from 1.2.5 so will need to be done separate so can be backported.
code sample based on test_drop_duplicates_categorical_bool
import pandas as pd
print(pd.__version__)
tc = pd.Series(
pd.Categorical(
[True, False, True, False, pd.NA], categories=[True, False], ordered=True
)
)
print(tc.duplicated())
1.2.5
0 False
1 False
2 True
3 True
4 False
dtype: bool
1.3.4
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_47357/1277064552.py in <module>
7 )
8 )
----> 9 print(tc.duplicated())
~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/series.py in duplicated(self, keep)
2215 dtype: bool
2216 """
-> 2217 res = self._duplicated(keep=keep)
2218 result = self._constructor(res, index=self.index)
2219 return result.__finalize__(self, method="duplicated")
~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/base.py in _duplicated(self, keep)
1230 self, keep: Literal["first", "last", False] = "first"
1231 ) -> np.ndarray:
-> 1232 return duplicated(self._values, keep=keep)
~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in duplicated(values, keep)
925 duplicated : ndarray[bool]
926 """
--> 927 values, _ = _ensure_data(values)
928 return htable.duplicated(values, keep=keep)
929
~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in _ensure_data(values)
139 # i.e. all-bool Categorical, BooleanArray
140 try:
--> 141 return np.asarray(values).astype("uint8", copy=False), values.dtype
142 except TypeError:
143 # GH#42107 we have pd.NAs present
ValueError: cannot convert float NaN to integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #44351 and will convert this to draft till fixed.
@@ -112,16 +112,19 @@ | |||
# --------------- # | |||
def _ensure_data(values: ArrayLike) -> np.ndarray: | |||
""" | |||
routine to ensure that our data is of the correct | |||
input dtype for lower-level routines | |||
Ensure values is of the correct input dtype for lower-level routines. | |||
|
|||
This will coerce: | |||
- ints -> int64 | |||
- uint -> uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the ints and uints are unchanged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didn't yet check those. will look tomorrow.
# ndarray[Any, Any]], Union[Any, ExtensionDtype]]", expected | ||
# "Tuple[ndarray[Any, Any], Union[dtype[Any], ExtensionDtype]]") | ||
return values # type: ignore[return-value] | ||
assert isinstance(values, np.ndarray) # for mypy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we potentially get here with PandasArray[complex]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. this is coded to return the values whereas it would need to either extract the underlting numpy array or if not ndarray backed would need to coerce to numpy array. This is how it's done in above for is_float_dtype
.
It used to be done this way before #42197. Those changes are in released pandas so I guess there are no 3rd party EA devs with issues.
The ignore was added in that PR and is not a false positive. We can either revert those changes or as I have done here, use an assert to fail fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we could leave the ignore for now and add a TODO: This is NOT a false positive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was thinking return np.asarray(values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, can also fix here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PandasArray[complex] can't be used to test as the numpy array is extracted from a PandasArray. So I guess will need to setup a dummy EA of complex dtype to test.
But, it also appears that we don't have tests where integer and floating EAs pass through _ensure_data
. Need to investigate this further as we either need tests or can remove code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we always call extract_array(foo, extract_numpy=True)
before getting here? if so, then a cast/ignore/assert seems benign.
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
status here? |
@simonjayhawkins status of this? |
Thanks for the pull request, but it appears to have gone stale. Feel free to reopen when you have time to merge main and continue. |
No description provided.