Skip to content

API: IntegerArray reductions: data type of the scalar result? #23106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Oct 12, 2018 · 5 comments
Open
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 12, 2018

Follow-up on #22762 which added _reduce to the ExtensionArray interface and added an implementation for IntegerArray.

Currently, for that IntegerArray we special case 'sum', 'min', 'max' to make sure they return ints (because the underlying implementation can be based on a float array, depending whether there are missing values).

# if we have a preservable numeric op,
# provide coercion back to an integer type if possible
elif name in ['sum', 'min', 'max', 'prod'] and notna(result):
int_result = int(result)
if int_result == result:
result = int_result

That basic check has the following "problem":

  • Some reductions with return python integers, others will return numpy scalars. Accessing single elements also gives numpy scalars.
In [3]: pd.Series(pd.core.arrays.integer_array([1, None, 3])).mean()
Out[3]: 2.0

In [4]: type(_)
Out[4]: numpy.float64

In [5]: pd.Series(pd.core.arrays.integer_array([1, None, 3])).sum()
Out[5]: 4

In [6]: type(_)
Out[6]: int

In [7]: type(pd.Series(pd.core.arrays.integer_array([1, None, 3]))[0])
Out[7]: numpy.int64
@jreback
Copy link
Contributor

jreback commented Oct 12, 2018

this add the patch, but some tests are failing

(pandas) bash-3.2$ git diff
diff --git a/pandas/core/arrays/integer.py b/pandas/core/arrays/integer.py
index 9917045f2..12b396ffd 100644
--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -267,7 +267,13 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
         if is_integer(item):
             if self._mask[item]:
                 return self.dtype.na_value
-            return self._data[item]
+            result = self._data[item]
+
+            if is_scalar(result):
+                # cast to python ints
+                result = int(result)
+            return result
+
         return type(self)(self._data[item], self._mask[item])
 
     def _coerce_to_ndarray(self):
diff --git a/pandas/tests/extension/test_integer.py b/pandas/tests/extension/test_integer.py
index 89c36bbe7..4bab4bd3f 100644
--- a/pandas/tests/extension/test_integer.py
+++ b/pandas/tests/extension/test_integer.py
@@ -177,7 +177,14 @@ class TestReshaping(base.BaseReshapingTests):
 
 
 class TestGetitem(base.BaseGetitemTests):
-    pass
+
+    def test_getitem_scalar(self, data):
+        # we guarantee that we produce python ints
+        result = data[0]
+        assert isinstance(result, int)
+
+        result = pd.Series(data)[0]
+        assert isinstance(result, int)
 
 

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 12, 2018
@jreback jreback added this to the Contributions Welcome milestone Oct 12, 2018
@jorisvandenbossche
Copy link
Member Author

@jreback that changes the output of accessing a single item to python int instead of numpy int. I was actually thinking we should ensure the reductions return a numpy scalar instead of python int to be consistent with accessing items.
But of course, the other way around as in your patch is also a possibility. Although then the IntegerDtype.type would not be correct any more.

And you would still have the inconsistency about return type of int vs np.float64 depending on the reduction type.

@TomAugspurger
Copy link
Contributor

I'm facing a similar issue with SparseArray.fill_value. With something like

>>> arr = SparseArray([0, 1], fill_value=0)
>>> isinstance(arr[0], arr.dtype.type)

is false, since it's a python int, and the scalar .dtype.type is np.int64 (so that arr[1] is correctly an np.int64).

Having these scalar types be numpy integers is more consistent with the way things are currently, but do we want to be bound to that going forward?

@jorisvandenbossche
Copy link
Member Author

So that indeed gives this inconsistency:

In [2]: arr = pd.SparseArray([0, 1], fill_value=0)

In [4]: type(arr[0])
Out[4]: int

In [5]: type(arr[1])
Out[5]: numpy.int64

which is bad, I would say.
We could enforce that the fill_value should be of the scalar type of the array? (or convert in the above case: convert the 0 to np.int64(0)) Or would this have many consequences in other places where the fill_value is used?

@TomAugspurger
Copy link
Contributor

We could enforce that the fill_value should be of the scalar type of the array?

I have a WIP doing that (well, making sure that the dtype matches), and this came up.

There's some issues with string dtypes, but overall I think it'll be fine to require that the scalar matches.

@mroeschke mroeschke added the Bug label Jun 23, 2021
@jbrockmendel jbrockmendel added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Dec 21, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

5 participants