Skip to content

API: Infer extension types in array #29799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 2, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 25 additions & 9 deletions doc/source/user_guide/integer_na.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,7 @@ numbers.

Pandas can represent integer data with possibly missing values using
:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>`
implemented within pandas. It is not the default dtype for integers, and will not be inferred;
you must explicitly pass the dtype into :meth:`array` or :class:`Series`:
implemented within pandas.

.. ipython:: python

Expand All @@ -50,17 +49,34 @@ NumPy array.
You can also pass the list-like object to the :class:`Series` constructor
with the dtype.

.. ipython:: python
.. warning::

s = pd.Series([1, 2, np.nan], dtype="Int64")
s
Currently :meth:`pandas.array` and :meth:`pandas.Series` use different
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my attempt to explain the inconsistency between pd.array and pd.Series we're introducing. It's not ideal, but I think it's the right behavior for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we want to share code between these (and ideally also the Index constructor), right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the issue (#29791), I mentioned for that reason the idea of a keyword to control this preference between old or new dtypes.
(but we can also introduce that at the moment we want to share code if that seems useful then)

rules for dtype inference. :meth:`pandas.array` will infer a nullable-
integer dtype

By default (if you don't specify ``dtype``), NumPy is used, and you'll end
up with a ``float64`` dtype Series:
.. ipython:: python

.. ipython:: python
pd.array([1, None])
pd.array([1, 2])

For backwards-compatibility, :class:`Series` infers these as either
integer or float dtype

.. ipython:: python

pd.Series([1, None])
pd.Series([1, 2])

We recommend explicitly providing the dtype to avoid confusion.

.. ipython:: python

pd.array([1, None], dtype="Int64")
pd.Series([1, None], dtype="Int64")

pd.Series([1, 2, np.nan])
In the future, we may provide an option for :class:`Series` to infer a
nullable-integer dtype.

Operations involving an integer array will behave similar to NumPy arrays.
Missing values will be propagated, and the data will be coerced to another
Expand Down
31 changes: 31 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,37 @@ The following methods now also correctly output values for unobserved categories

df.groupby(["cat_1", "cat_2"], observed=False)["value"].count()

:meth:`pandas.array` inference changes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:meth:`pandas.array` now infers pandas' new extension types in several cases:

1. Sting data (including missing values) now returns a :class:`arrays.StringArray`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sting -> String

2. Integer data (including missing values) now returns a :class:`arrays.IntegerArray`.

*pandas 0.25.x*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side issue, we are pretty inconsistent on showing the previous version in the whatsnew


.. code-block:: python

>>> pd.array(["a", None])
<PandasArray>
['a', None]
Length: 2, dtype: object

>>> pd.array([1, None])
<PandasArray>
[1, None]
Length: 2, dtype: object


*pandas 1.0.0*

.. ipython:: python

pd.array(["a", None])
pd.array([1, None])

As a reminder, you can specify the ``dtype`` to disable all inference.

.. _whatsnew_1000.api_breaking.deps:

Expand Down
8 changes: 6 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1113,6 +1113,7 @@ def infer_dtype(value: object, skipna: object=None) -> str:
Results can include:

- string
- mixed-string
- unicode
- bytes
- floating
Expand Down Expand Up @@ -1319,8 +1320,11 @@ def infer_dtype(value: object, skipna: object=None) -> str:
return 'boolean'

elif isinstance(val, str):
if is_string_array(values, skipna=skipna):
return 'string'
if is_string_array(values, skipna=True):
if isnaobj(values).any():
return "mixed-string"
else:
return "string"

elif isinstance(val, bytes):
if is_bytes_array(values, skipna=skipna):
Expand Down
25 changes: 21 additions & 4 deletions pandas/core/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,18 @@ def array(
:class:`pandas.Period` :class:`pandas.arrays.PeriodArray`
:class:`datetime.datetime` :class:`pandas.arrays.DatetimeArray`
:class:`datetime.timedelta` :class:`pandas.arrays.TimedeltaArray`
:class:`int` :class:`pandas.arrays.IntegerArray`
:class:`str` :class:`pandas.arrays.StringArray`
============================== =====================================

For all other cases, NumPy's usual inference rules will be used.

copy : bool, default True
.. versionchanged:: 1.0.0

Pandas infers nullable-integer dtype for integer data and
string dtype for string data.

copy : bool, default True
Whether to copy the data, even if not necessary. Depending
on the type of `data`, creating the new array may require
copying data, even if ``copy=False``.
Expand Down Expand Up @@ -246,21 +253,25 @@ def array(
"""
from pandas.core.arrays import (
period_array,
IntegerArray,
IntervalArray,
PandasArray,
DatetimeArray,
TimedeltaArray,
StringArray,
)

if lib.is_scalar(data):
msg = "Cannot pass scalar '{}' to 'pandas.array'."
raise ValueError(msg.format(data))

data = extract_array(data, extract_numpy=True)

if dtype is None and isinstance(data, ABCExtensionArray):
if dtype is None and isinstance(
data, (ABCSeries, ABCIndexClass, ABCExtensionArray)
):
dtype = data.dtype

data = extract_array(data, extract_numpy=True)

# this returns None for not-found dtypes.
if isinstance(dtype, str):
dtype = registry.find(dtype) or dtype
Expand Down Expand Up @@ -298,6 +309,12 @@ def array(
# timedelta, timedelta64
return TimedeltaArray._from_sequence(data, copy=copy)

elif inferred_dtype in {"string", "mixed-string"}:
return StringArray._from_sequence(data, copy=copy)

elif inferred_dtype in {"integer", "mixed-integer"}:
return IntegerArray._from_sequence(data, copy=copy)

# TODO(BooleanArray): handle this type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily for this PR, but is it viable to handle BooleanArray here now?


# Pandas overrides NumPy for
Expand Down
25 changes: 19 additions & 6 deletions pandas/tests/arrays/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@
"data, dtype, expected",
[
# Basic NumPy defaults.
([1, 2], None, PandasArray(np.array([1, 2]))),
([1, 2], None, pd.arrays.IntegerArray._from_sequence([1, 2])),
([1, 2], object, PandasArray(np.array([1, 2], dtype=object))),
(
[1, 2],
np.dtype("float32"),
PandasArray(np.array([1.0, 2.0], dtype=np.dtype("float32"))),
),
(np.array([1, 2]), None, PandasArray(np.array([1, 2]))),
(np.array([1, 2]), None, pd.arrays.IntegerArray._from_sequence([1, 2])),
# String alias passes through to NumPy
([1, 2], "float32", PandasArray(np.array([1, 2], dtype="float32"))),
# Period alias
Expand Down Expand Up @@ -113,6 +113,13 @@
# IntegerNA
([1, None], "Int16", integer_array([1, None], dtype="Int16")),
(pd.Series([1, 2]), None, PandasArray(np.array([1, 2], dtype=np.int64))),
# String
(["a", None], "string", pd.arrays.StringArray._from_sequence(["a", None])),
(
["a", None],
pd.StringDtype(),
pd.arrays.StringArray._from_sequence(["a", None]),
),
# Index
(pd.Index([1, 2]), None, PandasArray(np.array([1, 2], dtype=np.int64))),
# Series[EA] returns the EA
Expand All @@ -139,15 +146,15 @@ def test_array(data, dtype, expected):
def test_array_copy():
a = np.array([1, 2])
# default is to copy
b = pd.array(a)
b = pd.array(a, dtype=a.dtype)
assert np.shares_memory(a, b._ndarray) is False

# copy=True
b = pd.array(a, copy=True)
b = pd.array(a, dtype=a.dtype, copy=True)
assert np.shares_memory(a, b._ndarray) is False

# copy=False
b = pd.array(a, copy=False)
b = pd.array(a, dtype=a.dtype, copy=False)
assert np.shares_memory(a, b._ndarray) is True


Expand Down Expand Up @@ -211,6 +218,12 @@ def test_array_copy():
np.array([1, 2], dtype="m8[us]"),
pd.arrays.TimedeltaArray(np.array([1000, 2000], dtype="m8[ns]")),
),
# integer
([1, 2], pd.arrays.IntegerArray._from_sequence([1, 2])),
([1, None], pd.arrays.IntegerArray._from_sequence([1, None])),
# string
(["a", "b"], pd.arrays.StringArray._from_sequence(["a", "b"])),
(["a", None], pd.arrays.StringArray._from_sequence(["a", None])),
],
)
def test_array_inference(data, expected):
Expand Down Expand Up @@ -241,7 +254,7 @@ def test_array_inference_fails(data):
@pytest.mark.parametrize("data", [np.array([[1, 2], [3, 4]]), [[1, 2], [3, 4]]])
def test_nd_raises(data):
with pytest.raises(ValueError, match="PandasArray must be 1-dimensional"):
pd.array(data)
pd.array(data, dtype="int64")


def test_scalar_raises():
Expand Down
9 changes: 6 additions & 3 deletions pandas/tests/dtypes/test_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -732,12 +732,15 @@ def test_string(self):
def test_unicode(self):
arr = ["a", np.nan, "c"]
result = lib.infer_dtype(arr, skipna=False)
assert result == "mixed"
assert result == "mixed-string"

arr = ["a", np.nan, "c"]
result = lib.infer_dtype(arr, skipna=True)
expected = "string"
assert result == expected
assert result == "string"

arr = ["a", "c"]
result = lib.infer_dtype(arr, skipna=False)
assert result == "string"

@pytest.mark.parametrize(
"dtype, missing, skipna, expected",
Expand Down