Skip to content

API: Uses pd.NA in IntegerArray #29964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Dec 30, 2019
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
1eec965
API: Uses pd.NA in IntegerArray
TomAugspurger Dec 2, 2019
f5f61ea
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 2, 2019
c569562
wip
TomAugspurger Dec 2, 2019
a8261a4
wip
TomAugspurger Dec 3, 2019
c8ff04f
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 3, 2019
cddc9df
fixup value counts
TomAugspurger Dec 3, 2019
9488d34
fixed to_numpy
TomAugspurger Dec 3, 2019
0d5aab8
doc
TomAugspurger Dec 3, 2019
fa61a6d
wip
TomAugspurger Dec 3, 2019
de2c6c6
wip
TomAugspurger Dec 3, 2019
60d7663
wip
TomAugspurger Dec 3, 2019
a4c4618
fixup extension
TomAugspurger Dec 3, 2019
0a500be
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 4, 2019
1c716f3
update tests
TomAugspurger Dec 4, 2019
67c8d51
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 4, 2019
22a2bc7
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 4, 2019
34de18e
updates
TomAugspurger Dec 4, 2019
78944d1
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 5, 2019
ffbe299
wip
TomAugspurger Dec 5, 2019
7abf40e
API: Handle pow & rpow special cases
TomAugspurger Dec 5, 2019
36d403d
move
TomAugspurger Dec 6, 2019
f6b4062
Merge remote-tracking branch 'upstream/master' into na-pow
TomAugspurger Dec 6, 2019
945e8cd
revert
TomAugspurger Dec 6, 2019
04546f3
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 6, 2019
a493965
Merge remote-tracking branch 'upstream/master' into na-pow
TomAugspurger Dec 6, 2019
8fc8b3a
fixup
TomAugspurger Dec 6, 2019
a49aa65
handle negative
TomAugspurger Dec 6, 2019
8ad166d
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 6, 2019
dd745c3
Merge branch 'na-pow' into NA-scalar+IntegerArray
TomAugspurger Dec 6, 2019
88fa412
expand test
TomAugspurger Dec 6, 2019
0902eef
wip
TomAugspurger Dec 6, 2019
721a1ea
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 9, 2019
c658307
fixup
TomAugspurger Dec 9, 2019
4f9d775
exceptions
TomAugspurger Dec 9, 2019
1244ef4
wip
TomAugspurger Dec 9, 2019
4a34b45
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 9, 2019
5293d87
fixup
TomAugspurger Dec 9, 2019
39f225a
arrow
TomAugspurger Dec 9, 2019
ea19b2d
update
TomAugspurger Dec 9, 2019
fe2d98e
fixup
TomAugspurger Dec 10, 2019
68fe155
update
TomAugspurger Dec 10, 2019
f27a5c2
fixup
TomAugspurger Dec 10, 2019
b97450b
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 16, 2019
5d62af8
updates
TomAugspurger Dec 16, 2019
2bf57d6
test, repr
TomAugspurger Dec 16, 2019
2f4e1cd
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 17, 2019
021dc7b
fixup
TomAugspurger Dec 17, 2019
197f18b
enable
TomAugspurger Dec 17, 2019
259b779
fixup
TomAugspurger Dec 17, 2019
c0cfef9
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 18, 2019
3183d53
ints
TomAugspurger Dec 18, 2019
4986d84
restore comment
TomAugspurger Dec 18, 2019
76806e9
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 30, 2019
64b4ccc
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 30, 2019
b39dc60
docs
TomAugspurger Dec 30, 2019
800158d
docs
TomAugspurger Dec 30, 2019
e5d6832
fixup
TomAugspurger Dec 30, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion doc/source/user_guide/integer_na.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,16 @@ Nullable integer data type
IntegerArray is currently experimental. Its API or implementation may
change without warning.


In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent
missing data. Because ``NaN`` is a float, this forces an array of integers with
any missing values to become floating point. In some cases, this may not matter
much. But if your integer column is, say, an identifier, casting to float can
be problematic. Some integers cannot even be represented as floating point
numbers.

Construction
------------

Pandas can represent integer data with possibly missing values using
:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>`
implemented within pandas.
Expand All @@ -39,6 +41,12 @@ NumPy's ``'int64'`` dtype:

pd.array([1, 2, np.nan], dtype="Int64")

All NA-like values are replaced with :attr:`pandas.NA`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to add a versionchanged tag here (and below)


.. ipython:: python

pd.array([1, 2, np.nan, None, pd.NA], dtype="Int64")

This array can be stored in a :class:`DataFrame` or :class:`Series` like any
NumPy array.

Expand Down Expand Up @@ -78,6 +86,9 @@ with the dtype.
In the future, we may provide an option for :class:`Series` to infer a
nullable-integer dtype.

Operations
----------

Operations involving an integer array will behave similar to NumPy arrays.
Missing values will be propagated, and the data will be coerced to another
dtype if needed.
Expand Down Expand Up @@ -123,3 +134,15 @@ Reduction and groupby operations such as 'sum' work as well.

df.sum()
df.groupby('B').A.sum()

Scalar NA Value
---------------

:class:`arrays.IntegerArray` uses :attr:`pandas.NA` as its scalar
missing value. Slicing a single element that's missing will return
:attr:`pandas.NA`

.. ipython:: python

a = pd.array([1, None], dtype="Int64")
a[1]
6 changes: 4 additions & 2 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -712,7 +712,6 @@ def all(self, skipna: bool = True, **kwargs):
@classmethod
def _create_logical_method(cls, op):
def logical_method(self, other):

if isinstance(other, (ABCDataFrame, ABCSeries, ABCIndexClass)):
# Rely on pandas to unbox and dispatch to us.
return NotImplemented
Expand Down Expand Up @@ -760,8 +759,11 @@ def logical_method(self, other):
@classmethod
def _create_comparison_method(cls, op):
def cmp_method(self, other):
from pandas.arrays import IntegerArray

if isinstance(other, (ABCDataFrame, ABCSeries, ABCIndexClass)):
if isinstance(
other, (ABCDataFrame, ABCSeries, ABCIndexClass, IntegerArray)
):
# Rely on pandas to unbox and dispatch to us.
return NotImplemented

Expand Down
111 changes: 74 additions & 37 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import numbers
from typing import Type
from typing import Any, Tuple, Type
import warnings

import numpy as np

from pandas._libs import lib
from pandas._libs import lib, missing as libmissing
from pandas.compat import set_function_name
from pandas.util._decorators import cache_readonly

Expand Down Expand Up @@ -44,7 +44,7 @@ class _IntegerDtype(ExtensionDtype):
name: str
base = None
type: Type
na_value = np.nan
na_value = libmissing.NA

def __repr__(self) -> str:
sign = "U" if self.is_unsigned_integer else ""
Expand Down Expand Up @@ -263,6 +263,11 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):

.. versionadded:: 0.24.0

.. versionchanged:: 1.0.0

Now uses :attr:`pandas.NA` as its missing value, rather
than :attr:`numpy.nan`.

.. warning::

IntegerArray is currently experimental, and its API or internal
Expand Down Expand Up @@ -358,29 +363,37 @@ def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):
def _from_factorized(cls, values, original):
return integer_array(values, dtype=original.dtype)

def _formatter(self, boxed=False):
def fmt(x):
if isna(x):
return "NaN"
return str(x)

return fmt

def __getitem__(self, item):
if is_integer(item):
if self._mask[item]:
return self.dtype.na_value
return self._data[item]
return type(self)(self._data[item], self._mask[item])

def _coerce_to_ndarray(self):
def _coerce_to_ndarray(self, dtype=None, na_value=lib._no_default):
"""
coerce to an ndarary of object dtype
"""
if dtype is None:
dtype = object

if na_value is lib._no_default and is_float_dtype(dtype):
na_value = np.nan
elif na_value is lib._no_default:
na_value = libmissing.NA

if is_integer_dtype(dtype):
# Specifically, a NumPy integer dtype, not a pandas integer dtype,
# since we're coercing to a numpy dtype by definition in this function.
if not self.isna().any():
return self._data.astype(dtype)
else:
raise ValueError(
"cannot convert to integer NumPy array with missing values"
)

# TODO(jreback) make this better
data = self._data.astype(object)
data[self._mask] = self._na_value
data = self._data.astype(dtype)
data[self._mask] = na_value
return data

__array_priority__ = 1000 # higher than ndarray so ops dispatch to us
Expand All @@ -390,7 +403,7 @@ def __array__(self, dtype=None):
the array interface, return my values
We return an object array here to preserve our scalar values
"""
return self._coerce_to_ndarray()
return self._coerce_to_ndarray(dtype=dtype)

def __arrow_array__(self, type=None):
"""
Expand Down Expand Up @@ -506,7 +519,7 @@ def isna(self):

@property
def _na_value(self):
return np.nan
return self.dtype.na_value

@classmethod
def _concat_same_type(cls, to_concat):
Expand Down Expand Up @@ -545,7 +558,7 @@ def astype(self, dtype, copy=True):
return type(self)(result, mask=self._mask, copy=False)

# coerce
data = self._coerce_to_ndarray()
data = self._coerce_to_ndarray(dtype=dtype)
return astype_nansafe(data, dtype, copy=None)

@property
Expand Down Expand Up @@ -600,12 +613,19 @@ def value_counts(self, dropna=True):
# w/o passing the dtype
array = np.append(array, [self._mask.sum()])
index = Index(
np.concatenate([index.values, np.array([np.nan], dtype=object)]),
np.concatenate(
[index.values, np.array([self.dtype.na_value], dtype=object)]
),
dtype=object,
)

return Series(array, index=index)

def _values_for_factorize(self) -> Tuple[np.ndarray, Any]:
# TODO: https://github.com/pandas-dev/pandas/issues/30037
# use masked algorithms, rather than object-dtype / np.nan.
return self._coerce_to_ndarray(na_value=np.nan), np.nan

def _values_for_argsort(self) -> np.ndarray:
"""Return values for sorting.

Expand All @@ -629,9 +649,11 @@ def _create_comparison_method(cls, op):

@unpack_zerodim_and_defer(op.__name__)
def cmp_method(self, other):
from pandas.arrays import BooleanArray

mask = None

if isinstance(other, IntegerArray):
if isinstance(other, (BooleanArray, IntegerArray)):
other, mask = other._data, other._mask

elif is_list_like(other):
Expand All @@ -643,25 +665,30 @@ def cmp_method(self, other):
if len(self) != len(other):
raise ValueError("Lengths must match to compare")

# numpy will show a DeprecationWarning on invalid elementwise
# comparisons, this will raise in the future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous question about this. Is this comment no longer relevant or correct? Or why was it removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, do you know how this is actually hit? If NumPy is going to raise in the future, shouldn't they be seeing that warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about the warning you get with comparisons with objects / non-broadcastable arrays. Eg:

In [29]: np.array([1, 2]) == "b"   
/home/joris/miniconda3/envs/dev/bin/ipython:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  #!/home/joris/miniconda3/envs/dev/bin/python
Out[29]: False

In [30]: pd.array([1, 2]) == "b" 
Out[30]: array([False, False])

(it seems IntegerArray already handles this fine, not sure there is a explicit test for that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems IntegerArray already handles this fine,

Gotch. It's silencing the same warning from NumPy, and falling back to invalid_comparison, which returns the expected result. I'll restore the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... the comment is incorrect. NumPy will perform elementwise comparison in the future, not raise. If they were to raise on that in the future the implementation would be incorrect.

Though I'm still a bit confused, as the NumPy op is returning NotImplemented since we're calling it directly. Will that continue to return NotImplemented? Or will the elementwise result be different?

with warnings.catch_warnings():
warnings.filterwarnings("ignore", "elementwise", FutureWarning)
with np.errstate(all="ignore"):
method = getattr(self._data, f"__{op_name}__")
result = method(other)
if other is libmissing.NA:
# numpy does not handle pd.NA well as "other" scalar (it returns
# a scalar False instead of an array)
# This may be fixed by NA.__array_ufunc__. Revisit this check
# once that's implemented.
result = np.zeros(self._data.shape, dtype="bool")
mask = np.ones(self._data.shape, dtype="bool")
else:
with warnings.catch_warnings():
warnings.filterwarnings("ignore", "elementwise", FutureWarning)
with np.errstate(all="ignore"):
method = getattr(self._data, f"__{op_name}__")
result = method(other)

if result is NotImplemented:
result = invalid_comparison(self._data, other, op)

# nans propagate
if mask is None:
mask = self._mask
mask = self._mask.copy()
else:
mask = self._mask | mask

result[mask] = op_name == "ne"
return result
return BooleanArray(result, mask)

name = f"__{op.__name__}__"
return set_function_name(cmp_method, name, cls)
Expand All @@ -673,7 +700,8 @@ def _reduce(self, name, skipna=True, **kwargs):
# coerce to a nan-aware float if needed
if mask.any():
data = self._data.astype("float64")
data[mask] = self._na_value
# We explicitly use NaN within reductions.
data[mask] = np.nan

op = getattr(nanops, "nan" + name)
result = op(data, axis=0, skipna=skipna, mask=mask, **kwargs)
Expand Down Expand Up @@ -739,12 +767,13 @@ def integer_arithmetic_method(self, other):
raise TypeError("can only perform ops with numeric values")

else:
if not (is_float(other) or is_integer(other)):
if not (is_float(other) or is_integer(other) or other is libmissing.NA):
raise TypeError("can only perform ops with numeric values")

# nans propagate
if omask is None:
mask = self._mask.copy()
if other is libmissing.NA:
mask |= True
else:
mask = self._mask | omask

Expand All @@ -754,20 +783,23 @@ def integer_arithmetic_method(self, other):
# x ** 0 is 1.
if omask is not None:
mask = np.where((other == 0) & ~omask, False, mask)
else:
elif other is not libmissing.NA:
mask = np.where(other == 0, False, mask)

elif op_name == "rpow":
# 1 ** x is 1.
if omask is not None:
mask = np.where((other == 1) & ~omask, False, mask)
else:
elif other is not libmissing.NA:
mask = np.where(other == 1, False, mask)
# x ** 0 is 1.
mask = np.where((self._data == 0) & ~self._mask, False, mask)

with np.errstate(all="ignore"):
result = op(self._data, other)
if other is libmissing.NA:
result = np.ones_like(self._data)
else:
with np.errstate(all="ignore"):
result = op(self._data, other)

# divmod returns a tuple
if op_name == "divmod":
Expand All @@ -790,6 +822,11 @@ def integer_arithmetic_method(self, other):
_dtype_docstring = """
An ExtensionDtype for {dtype} integer data.

.. versionchanged:: 1.0.0

Now uses :attr:`pandas.NA` as its missing value,
rather than :attr:`numpy.nan`.

Attributes
----------
None
Expand Down
Loading