-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: Infer extension types in array #29799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3313f23
dd02d69
5a9c306
e3ba846
8d6f79b
e055ada
ad43c3a
77c5d3f
0f89f47
4e08fd2
bddce9b
f63e0ef
372ac06
799dcce
b6082d1
d0f3082
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -303,6 +303,38 @@ The following methods now also correctly output values for unobserved categories | |
|
||
df.groupby(["cat_1", "cat_2"], observed=False)["value"].count() | ||
|
||
:meth:`pandas.array` inference changes | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:meth:`pandas.array` now infers pandas' new extension types in several cases (:issue:`29791`): | ||
|
||
1. String data (including missing values) now returns a :class:`arrays.StringArray`. | ||
2. Integer data (including missing values) now returns a :class:`arrays.IntegerArray`. | ||
3. Boolean data (including missing values) now returns the new :class:`arrays.BooleanArray` | ||
|
||
*pandas 0.25.x* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. side issue, we are pretty inconsistent on showing the previous version in the whatsnew |
||
|
||
.. code-block:: python | ||
|
||
>>> pd.array(["a", None]) | ||
<PandasArray> | ||
['a', None] | ||
Length: 2, dtype: object | ||
|
||
>>> pd.array([1, None]) | ||
<PandasArray> | ||
[1, None] | ||
Length: 2, dtype: object | ||
|
||
|
||
*pandas 1.0.0* | ||
|
||
.. ipython:: python | ||
|
||
pd.array(["a", None]) | ||
pd.array([1, None]) | ||
|
||
As a reminder, you can specify the ``dtype`` to disable all inference. | ||
|
||
By default :meth:`Categorical.min` now returns the minimum instead of np.nan | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
@@ -408,7 +440,6 @@ Other API changes | |
- :meth:`Series.dropna` has dropped its ``**kwargs`` argument in favor of a single ``how`` parameter. | ||
Supplying anything else than ``how`` to ``**kwargs`` raised a ``TypeError`` previously (:issue:`29388`) | ||
- When testing pandas, the new minimum required version of pytest is 5.0.1 (:issue:`29664`) | ||
- | ||
|
||
|
||
.. _whatsnew_1000.api.documentation: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -94,10 +94,19 @@ def array( | |
:class:`pandas.Period` :class:`pandas.arrays.PeriodArray` | ||
:class:`datetime.datetime` :class:`pandas.arrays.DatetimeArray` | ||
:class:`datetime.timedelta` :class:`pandas.arrays.TimedeltaArray` | ||
:class:`int` :class:`pandas.arrays.IntegerArray` | ||
:class:`str` :class:`pandas.arrays.StringArray` | ||
:class:`bool` :class:`pandas.arrays.BooleanArray` | ||
============================== ===================================== | ||
|
||
For all other cases, NumPy's usual inference rules will be used. | ||
|
||
.. versionchanged:: 1.0.0 | ||
|
||
Pandas infers nullable-integer dtype for integer data, | ||
string dtype for string data, and nullable-boolean dtype | ||
for boolean data. | ||
|
||
copy : bool, default True | ||
Whether to copy the data, even if not necessary. Depending | ||
on the type of `data`, creating the new array may require | ||
|
@@ -154,14 +163,6 @@ def array( | |
['a', 'b'] | ||
Length: 2, dtype: str32 | ||
|
||
Or use the dedicated constructor for the array you're expecting, and | ||
wrap that in a PandasArray | ||
|
||
>>> pd.array(np.array(['a', 'b'], dtype='<U1')) | ||
<PandasArray> | ||
['a', 'b'] | ||
Length: 2, dtype: str32 | ||
|
||
Finally, Pandas has arrays that mostly overlap with NumPy | ||
|
||
* :class:`arrays.DatetimeArray` | ||
|
@@ -184,20 +185,28 @@ def array( | |
|
||
Examples | ||
-------- | ||
If a dtype is not specified, `data` is passed through to | ||
:meth:`numpy.array`, and a :class:`arrays.PandasArray` is returned. | ||
If a dtype is not specified, pandas will infer the best dtype from the values. | ||
See the description of `dtype` for the types pandas infers for. | ||
|
||
>>> pd.array([1, 2]) | ||
<PandasArray> | ||
<IntegerArray> | ||
[1, 2] | ||
Length: 2, dtype: int64 | ||
Length: 2, dtype: Int64 | ||
|
||
Or the NumPy dtype can be specified | ||
>>> pd.array([1, 2, np.nan]) | ||
<IntegerArray> | ||
[1, 2, NaN] | ||
Length: 3, dtype: Int64 | ||
|
||
>>> pd.array([1, 2], dtype=np.dtype("int32")) | ||
<PandasArray> | ||
[1, 2] | ||
Length: 2, dtype: int32 | ||
>>> pd.array(["a", None, "c"]) | ||
<StringArray> | ||
['a', nan, 'c'] | ||
Length: 3, dtype: string | ||
|
||
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")]) | ||
<PeriodArray> | ||
['2000-01-01', '2000-01-01'] | ||
Length: 2, dtype: period[D] | ||
|
||
You can use the string alias for `dtype` | ||
|
||
|
@@ -212,29 +221,24 @@ def array( | |
[a, b, a] | ||
Categories (3, object): [a < b < c] | ||
|
||
Because omitting the `dtype` passes the data through to NumPy, | ||
a mixture of valid integers and NA will return a floating-point | ||
NumPy array. | ||
If pandas does not infer a dedicated extension type a | ||
:class:`arrays.PandasArray` is returned. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we mention that this can still change in the future? (eg that more types start to get inferred, so basically that you should not rely on the fact of |
||
|
||
>>> pd.array([1, 2, np.nan]) | ||
>>> pd.array([1.1, 2.2]) | ||
<PandasArray> | ||
[1.0, 2.0, nan] | ||
Length: 3, dtype: float64 | ||
|
||
To use pandas' nullable :class:`pandas.arrays.IntegerArray`, specify | ||
the dtype: | ||
[1.1, 2.2] | ||
Length: 2, dtype: float64 | ||
|
||
>>> pd.array([1, 2, np.nan], dtype='Int64') | ||
<IntegerArray> | ||
[1, 2, NaN] | ||
Length: 3, dtype: Int64 | ||
As mentioned in the "Notes" section, new extension types may be added | ||
in the future (by pandas or 3rd party libraries), causing the return | ||
value to no longer be a :class:`arrays.PandasArray`. Specify the `dtype` | ||
as a NumPy dtype if you need to ensure there's no future change in | ||
behavior. | ||
|
||
Pandas will infer an ExtensionArray for some types of data: | ||
|
||
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")]) | ||
<PeriodArray> | ||
['2000-01-01', '2000-01-01'] | ||
Length: 2, dtype: period[D] | ||
>>> pd.array([1, 2], dtype=np.dtype("int32")) | ||
<PandasArray> | ||
[1, 2] | ||
Length: 2, dtype: int32 | ||
|
||
`data` must be 1-dimensional. A ValueError is raised when the input | ||
has the wrong dimensionality. | ||
|
@@ -246,21 +250,26 @@ def array( | |
""" | ||
from pandas.core.arrays import ( | ||
period_array, | ||
BooleanArray, | ||
IntegerArray, | ||
IntervalArray, | ||
PandasArray, | ||
DatetimeArray, | ||
TimedeltaArray, | ||
StringArray, | ||
) | ||
|
||
if lib.is_scalar(data): | ||
msg = "Cannot pass scalar '{}' to 'pandas.array'." | ||
raise ValueError(msg.format(data)) | ||
|
||
data = extract_array(data, extract_numpy=True) | ||
|
||
if dtype is None and isinstance(data, ABCExtensionArray): | ||
if dtype is None and isinstance( | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data, (ABCSeries, ABCIndexClass, ABCExtensionArray) | ||
): | ||
dtype = data.dtype | ||
|
||
data = extract_array(data, extract_numpy=True) | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# this returns None for not-found dtypes. | ||
if isinstance(dtype, str): | ||
dtype = registry.find(dtype) or dtype | ||
|
@@ -270,7 +279,7 @@ def array( | |
return cls._from_sequence(data, dtype=dtype, copy=copy) | ||
|
||
if dtype is None: | ||
inferred_dtype = lib.infer_dtype(data, skipna=False) | ||
inferred_dtype = lib.infer_dtype(data, skipna=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so my issue with this PR is that is duplicating a lot of logic that is already held here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/lib.pyx#L1948, so now we have 2 places with slightly different ways of doing things. This routine is slightly more 'high-level', but myabe_convert_objects is way more used internally. So how to reconcile these things? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not familiar with
Should we update I do see the similarity in purpose though. They're both for taking potentially untyped things and converting them to a typed array. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Probably not in the short term. Best guess for de-duplication for the discussed functions will look something like:
|
||
if inferred_dtype == "period": | ||
try: | ||
return period_array(data, copy=copy) | ||
|
@@ -298,7 +307,14 @@ def array( | |
# timedelta, timedelta64 | ||
return TimedeltaArray._from_sequence(data, copy=copy) | ||
|
||
# TODO(BooleanArray): handle this type | ||
elif inferred_dtype == "string": | ||
return StringArray._from_sequence(data, copy=copy) | ||
|
||
elif inferred_dtype == "integer": | ||
return IntegerArray._from_sequence(data, copy=copy) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
elif inferred_dtype == "boolean": | ||
return BooleanArray._from_sequence(data, copy=copy) | ||
|
||
# Pandas overrides NumPy for | ||
# 1. datetime64[ns] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my attempt to explain the inconsistency between
pd.array
andpd.Series
we're introducing. It's not ideal, but I think it's the right behavior for now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we want to share code between these (and ideally also the Index constructor), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue (#29791), I mentioned for that reason the idea of a keyword to control this preference between old or new dtypes.
(but we can also introduce that at the moment we want to share code if that seems useful then)