-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: Infer extension types in array #29799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
3313f23
dd02d69
5a9c306
e3ba846
8d6f79b
e055ada
ad43c3a
77c5d3f
0f89f47
4e08fd2
bddce9b
f63e0ef
372ac06
799dcce
b6082d1
d0f3082
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -234,6 +234,37 @@ The following methods now also correctly output values for unobserved categories | |
|
||
df.groupby(["cat_1", "cat_2"], observed=False)["value"].count() | ||
|
||
:meth:`pandas.array` inference changes | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:meth:`pandas.array` now infers pandas' new extension types in several cases: | ||
|
||
1. Sting data (including missing values) now returns a :class:`arrays.StringArray`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sting -> String |
||
2. Integer data (including missing values) now returns a :class:`arrays.IntegerArray`. | ||
|
||
*pandas 0.25.x* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. side issue, we are pretty inconsistent on showing the previous version in the whatsnew |
||
|
||
.. code-block:: python | ||
|
||
>>> pd.array(["a", None]) | ||
<PandasArray> | ||
['a', None] | ||
Length: 2, dtype: object | ||
|
||
>>> pd.array([1, None]) | ||
<PandasArray> | ||
[1, None] | ||
Length: 2, dtype: object | ||
|
||
|
||
*pandas 1.0.0* | ||
|
||
.. ipython:: python | ||
|
||
pd.array(["a", None]) | ||
pd.array([1, None]) | ||
|
||
As a reminder, you can specify the ``dtype`` to disable all inference. | ||
|
||
.. _whatsnew_1000.api_breaking.deps: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -94,11 +94,18 @@ def array( | |
:class:`pandas.Period` :class:`pandas.arrays.PeriodArray` | ||
:class:`datetime.datetime` :class:`pandas.arrays.DatetimeArray` | ||
:class:`datetime.timedelta` :class:`pandas.arrays.TimedeltaArray` | ||
:class:`int` :class:`pandas.arrays.IntegerArray` | ||
:class:`str` :class:`pandas.arrays.StringArray` | ||
============================== ===================================== | ||
|
||
For all other cases, NumPy's usual inference rules will be used. | ||
|
||
copy : bool, default True | ||
.. versionchanged:: 1.0.0 | ||
|
||
Pandas infers nullable-integer dtype for integer data and | ||
string dtype for string data. | ||
|
||
copy : bool, default True | ||
Whether to copy the data, even if not necessary. Depending | ||
on the type of `data`, creating the new array may require | ||
copying data, even if ``copy=False``. | ||
|
@@ -246,21 +253,25 @@ def array( | |
""" | ||
from pandas.core.arrays import ( | ||
period_array, | ||
IntegerArray, | ||
IntervalArray, | ||
PandasArray, | ||
DatetimeArray, | ||
TimedeltaArray, | ||
StringArray, | ||
) | ||
|
||
if lib.is_scalar(data): | ||
msg = "Cannot pass scalar '{}' to 'pandas.array'." | ||
raise ValueError(msg.format(data)) | ||
|
||
data = extract_array(data, extract_numpy=True) | ||
|
||
if dtype is None and isinstance(data, ABCExtensionArray): | ||
if dtype is None and isinstance( | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data, (ABCSeries, ABCIndexClass, ABCExtensionArray) | ||
): | ||
dtype = data.dtype | ||
|
||
data = extract_array(data, extract_numpy=True) | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# this returns None for not-found dtypes. | ||
if isinstance(dtype, str): | ||
dtype = registry.find(dtype) or dtype | ||
|
@@ -298,6 +309,12 @@ def array( | |
# timedelta, timedelta64 | ||
return TimedeltaArray._from_sequence(data, copy=copy) | ||
|
||
elif inferred_dtype in {"string", "mixed-string"}: | ||
return StringArray._from_sequence(data, copy=copy) | ||
|
||
elif inferred_dtype in {"integer", "mixed-integer"}: | ||
return IntegerArray._from_sequence(data, copy=copy) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# TODO(BooleanArray): handle this type | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not necessarily for this PR, but is it viable to handle BooleanArray here now? |
||
|
||
# Pandas overrides NumPy for | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my attempt to explain the inconsistency between
pd.array
andpd.Series
we're introducing. It's not ideal, but I think it's the right behavior for now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we want to share code between these (and ideally also the Index constructor), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue (#29791), I mentioned for that reason the idea of a keyword to control this preference between old or new dtypes.
(but we can also introduce that at the moment we want to share code if that seems useful then)