API: IntegerArray reductions: data type of the scalar result? #23106

jorisvandenbossche · 2018-10-12T12:18:44Z

Follow-up on #22762 which added _reduce to the ExtensionArray interface and added an implementation for IntegerArray.

Currently, for that IntegerArray we special case 'sum', 'min', 'max' to make sure they return ints (because the underlying implementation can be based on a float array, depending whether there are missing values).

pandas/pandas/core/arrays/integer.py

Lines 549 to 554 in 08e2752

    
           # if we have a preservable numeric op, 
        
           # provide coercion back to an integer type if possible 
        
           elif name in ['sum', 'min', 'max', 'prod'] and notna(result): 
        
               int_result = int(result) 
        
               if int_result == result: 
        
                   result = int_result

That basic check has the following "problem":

Some reductions with return python integers, others will return numpy scalars. Accessing single elements also gives numpy scalars.

In [3]: pd.Series(pd.core.arrays.integer_array([1, None, 3])).mean()
Out[3]: 2.0

In [4]: type(_)
Out[4]: numpy.float64

In [5]: pd.Series(pd.core.arrays.integer_array([1, None, 3])).sum()
Out[5]: 4

In [6]: type(_)
Out[6]: int

In [7]: type(pd.Series(pd.core.arrays.integer_array([1, None, 3]))[0])
Out[7]: numpy.int64

The text was updated successfully, but these errors were encountered:

jreback · 2018-10-12T12:22:43Z

this add the patch, but some tests are failing

(pandas) bash-3.2$ git diff
diff --git a/pandas/core/arrays/integer.py b/pandas/core/arrays/integer.py
index 9917045f2..12b396ffd 100644
--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -267,7 +267,13 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
         if is_integer(item):
             if self._mask[item]:
                 return self.dtype.na_value
-            return self._data[item]
+            result = self._data[item]
+
+            if is_scalar(result):
+                # cast to python ints
+                result = int(result)
+            return result
+
         return type(self)(self._data[item], self._mask[item])
 
     def _coerce_to_ndarray(self):
diff --git a/pandas/tests/extension/test_integer.py b/pandas/tests/extension/test_integer.py
index 89c36bbe7..4bab4bd3f 100644
--- a/pandas/tests/extension/test_integer.py
+++ b/pandas/tests/extension/test_integer.py
@@ -177,7 +177,14 @@ class TestReshaping(base.BaseReshapingTests):
 
 
 class TestGetitem(base.BaseGetitemTests):
-    pass
+
+    def test_getitem_scalar(self, data):
+        # we guarantee that we produce python ints
+        result = data[0]
+        assert isinstance(result, int)
+
+        result = pd.Series(data)[0]
+        assert isinstance(result, int)

jorisvandenbossche · 2018-10-15T09:00:39Z

@jreback that changes the output of accessing a single item to python int instead of numpy int. I was actually thinking we should ensure the reductions return a numpy scalar instead of python int to be consistent with accessing items.
But of course, the other way around as in your patch is also a possibility. Although then the IntegerDtype.type would not be correct any more.

And you would still have the inconsistency about return type of int vs np.float64 depending on the reduction type.

TomAugspurger · 2018-10-17T13:55:24Z

I'm facing a similar issue with SparseArray.fill_value. With something like

>>> arr = SparseArray([0, 1], fill_value=0)
>>> isinstance(arr[0], arr.dtype.type)

is false, since it's a python int, and the scalar .dtype.type is np.int64 (so that arr[1] is correctly an np.int64).

Having these scalar types be numpy integers is more consistent with the way things are currently, but do we want to be bound to that going forward?

jorisvandenbossche · 2018-10-17T15:35:12Z

So that indeed gives this inconsistency:

In [2]: arr = pd.SparseArray([0, 1], fill_value=0)

In [4]: type(arr[0])
Out[4]: int

In [5]: type(arr[1])
Out[5]: numpy.int64

which is bad, I would say.
We could enforce that the fill_value should be of the scalar type of the array? (or convert in the above case: convert the 0 to np.int64(0)) Or would this have many consequences in other places where the fill_value is used?

TomAugspurger · 2018-10-17T15:46:58Z

We could enforce that the fill_value should be of the scalar type of the array?

I have a WIP doing that (well, making sure that the dtype matches), and this came up.

There's some issues with string dtypes, but overall I think it'll be fine to require that the scalar matches.

jorisvandenbossche mentioned this issue Oct 12, 2018

ENH: add groupby & reduce support to EA #22762

Merged

jreback added Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 12, 2018

jreback added this to the Contributions Welcome milestone Oct 12, 2018

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

mroeschke added the Bug label Jun 23, 2021

jbrockmendel added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Dec 21, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: IntegerArray reductions: data type of the scalar result? #23106

API: IntegerArray reductions: data type of the scalar result? #23106

jorisvandenbossche commented Oct 12, 2018 •

edited

Loading

jreback commented Oct 12, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 17, 2018

jorisvandenbossche commented Oct 17, 2018

TomAugspurger commented Oct 17, 2018

API: IntegerArray reductions: data type of the scalar result? #23106

API: IntegerArray reductions: data type of the scalar result? #23106

Comments

jorisvandenbossche commented Oct 12, 2018 • edited Loading

jreback commented Oct 12, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 17, 2018

jorisvandenbossche commented Oct 17, 2018

TomAugspurger commented Oct 17, 2018

jorisvandenbossche commented Oct 12, 2018 •

edited

Loading