Skip to content

Support uncertainties as EA Dtype #53970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ Other enhancements
- :meth:`MultiIndex.sortlevel` and :meth:`Index.sortlevel` gained a new keyword ``na_position`` (:issue:`51612`)
- :meth:`arrays.DatetimeArray.map`, :meth:`arrays.TimedeltaArray.map` and :meth:`arrays.PeriodArray.map` can now take a ``na_action`` argument (:issue:`51644`)
- :meth:`arrays.SparseArray.map` now supports ``na_action`` (:issue:`52096`).
- :meth:`arrays.BaseMaskedArray.take` handle non-na float as fill value (triggered by ``ufloat`` NaN from ``uncertainties`` package) (see `PR <https://github.com/pandas-dev/pandas/pull/53970>`_)
- Add :meth:`diff()` and :meth:`round()` for :class:`Index` (:issue:`19708`)
- Add dtype of categories to ``repr`` information of :class:`CategoricalDtype` (:issue:`52179`)
- Added to the escape mode "latex-math" preserving without escaping all characters between "\(" and "\)" in formatter (:issue:`51903`)
Expand All @@ -107,6 +108,7 @@ Other enhancements
- :meth:`Categorical.from_codes` has gotten a ``validate`` parameter (:issue:`50975`)
- :meth:`DataFrame.unstack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`)
- :meth:`DataFrameGroupby.agg` and :meth:`DataFrameGroupby.transform` now support grouping by multiple keys when the index is not a :class:`MultiIndex` for ``engine="numba"`` (:issue:`53486`)
- :meth:`GroupBy.first` if series contains only NA values (which might be NaN), return the first NA value, else return np.nan (see `PR <https://github.com/pandas-dev/pandas/pull/53970>`_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is decribing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is describing a change of behavior of the GroupBy.first helper function to not assume that np.nan is a valid NA value for the EA in question. The idea of the change (which, if valid, should also be applied to GroupBy.last) is that if we cannot find any non-NA values in the array, but if there are NA values, the return value should be the first (or last) NA value in the array. I did this because I thought that my EA could only deal with NaN values that were UFloats (having a nominal value and a std_dev value). But as I traced through the code and observed things more closely, I see that the EA does have tolerance for np.nan as an element among UFloats, giving np.nan as the nominal value and zero as the std_dev. So I will remove this change (by being less strict in some of my own assumptions in checking EA consistency).

- :meth:`Series.explode` now supports pyarrow-backed list types (:issue:`53602`)
- :meth:`Series.str.join` now supports ``ArrowDtype(pa.string())`` (:issue:`53646`)
- :meth:`SeriesGroupby.agg` and :meth:`DataFrameGroupby.agg` now support passing in multiple functions for ``engine="numba"`` (:issue:`53486`)
Expand Down Expand Up @@ -552,6 +554,7 @@ Other
- Bug in :meth:`Series.map` when giving a callable to an empty series, the returned series had ``object`` dtype. It now keeps the original dtype (:issue:`52384`)
- Bug in :meth:`Series.memory_usage` when ``deep=True`` throw an error with Series of objects and the returned value is incorrect, as it does not take into account GC corrections (:issue:`51858`)
- Fixed incorrect ``__name__`` attribute of ``pandas._libs.json`` (:issue:`52898`)
- Change ``~tests/extension/test_boolean.py`` to use pd.NA instead of np.nan (following similar patterns in ``~tests/extension/test_integer.py`` and ``~tests/extension/test_float.py``), updating :func:`make_data`, :func:`data_missing`, :func:`data_missing_for_sorting`, :func:`data_for_grouping`, :func:`_check_op` (see `PR <https://github.com/pandas-dev/pandas/pull/53970>`_)

.. ***DO NOT USE THIS SECTION***

Expand Down
4 changes: 3 additions & 1 deletion pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
Expand Up @@ -870,7 +870,9 @@ def take(
# we only fill where the indexer is null
# not existing missing values
# TODO(jreback) what if we have a non-na float as a fill value?
if allow_fill and notna(fill_value):
# NaN with uncertainties is scalar but does not register as `isna`,
# so use fact that NaN != NaN
if allow_fill and notna(fill_value) and fill_value == fill_value:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you even get here with uncertainties? is your EA subclassing BaseMaskedArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EA itself is a very vanilla EA. The test suite instantiates the EA in myriad ways for test purposes. In this case the EA gets instantiated as a MaskedArray (which derives from BaseMaskedArray) for testing MaskedArray functionality. I think the test suite has every right to rigorously instantiate EAs in every way imaginable, which it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im still confused here. the EA i see in hgrecco/pint-pandas#140 subclasses ExtensionArray but not BaseMaskedArray, so i don't see how the code change here would affect it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK...I looked more deeply into this when I get back the computer where I did original work and didn't upload some notes to the cloud. I'm pretty sure it's due to some conditions created when trying to merge two DataFrames that I dummied up in my notes. I'll create a test case that can go into the Pandas test suite. It won't depend on uncertainties, but if and when used with uncertainties, it should trigger the path in question. Should be done by Monday.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that was a lot more challenging to reconstruct than I expected, but I re-created the condition. TL;DR: now that I use pd.NA as my NA value for my particular EA, this change that caught your eye is indeed unnecessary and I can revert it.

But, since you asked, here's the path I took initially. Starting with PintType we define this:

    @property
    def na_value(self):
        if HAS_UNCERTAINTIES:
            return self.ureg.Quantity(_ufloat_nan, self.units)
        else:
            return self.ureg.Quantity(np.nan, self.units)

The idea (before settling on pd.NA, which works without this particular change) was to use a NaN to which we could cast any numerical type (we cannot downcast uncertainties to float).

In the test code we set up data_missing and data_missing_for_sorting to use ufloat(np.nan, 0) as the NA value where appropriate. The test case test_groupby_agg_extension uses data_for_grouping, which is called without numeric_dtype as None, and this is where we start going off the rails.

Here's the data_for_grouping code:

@pytest.fixture
def data_for_grouping(numeric_dtype, USE_UNCERTAINTIES):
    a = 1.0
    b = 2.0**32 + 1
    c = 2.0**32 + 10
    if USE_UNCERTAINTIES:
        a = a + ufloat(0, 0)
        b = b + ufloat(0, 0)
        c = c + ufloat(0, 0)
        _n = _ufloat_nan
        numeric_dtype = None
    elif numeric_dtype:
        numeric_dtype = dtypemap.get(numeric_dtype, numeric_dtype)
        _n = np.nan
    else:
        _n = pd.NA
    return PintArray.from_1darray_quantity(
        ureg.Quantity(pd.array([b, b, _n, _n, a, a, b, c], dtype=numeric_dtype), ureg.m)
    )

The way the test code is run, because numeric_dtype is None we get:

(Pdb) data_for_grouping
<PintArray>
[4294967297.0, 4294967297.0, <NA>, <NA>, 1.0, 1.0, 4294967297.0, 4294967306.0]
Length: 8, dtype: pint[meter]

When we execute expected = df.iloc[[0, 2, 4, 7]] there comes a point where we come here (/Users/michael/Documents/GitHub/MichaelTiemannOSC/pandas-dev/pandas/core/internals/managers.py(646)<listcomp>()):

644  	        else:
645  	            new_blocks = [
646  ->	                blk.take_nd(
647  	                    indexer,
648  	                    axis=1,
649  	                    fill_value=(
650  	                        fill_value if fill_value is not None else blk.fill_value
651  	                    ),

With fill_value as None, we inherit the fill value from the PintArray, which is the uncertain NaN. Here's the stack trace down from there:

  pandas/core/internals/managers.py(645)reindex_indexer()
-> new_blocks = [
  pandas/core/internals/managers.py(646)<listcomp>()
-> blk.take_nd(
  pandas/core/internals/blocks.py(975)take_nd()
-> new_values = algos.take_nd(
  pandas/core/array_algos/take.py(115)take_nd()
-> return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
  pint-pandas/pint_pandas/pint_array.py(523)take()
-> result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
  pandas/core/algorithms.py(1307)take()
-> result = take_nd(
  pandas/core/array_algos/take.py(115)take_nd()
-> return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
> pandas/core/arrays/masked.py(879)take()
-> result[fill_mask] = fill_value

In this particular case, because none of the indexes are -1, no filling is going to happen. But should we have things to fill, they'd be filled with the uncertain nan. And because this code really doesn't want to fill add new NaNs to the array, I put this extra defensive code in. And that's how we hit this condition.

I'm going to take this extraneous test out (fill_value == fill_value). Does this answer your question?

fill_mask = np.asarray(indexer) == -1
result[fill_mask] = fill_value
mask = mask ^ fill_mask
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -3200,7 +3200,10 @@ def first(x: Series):
"""Helper function for first item that isn't NA."""
arr = x.array[notna(x.array)]
if not len(arr):
return np.nan
nan_arr = x.array[isna(x.array)]
if not len(nan_arr):
return np.nan
return nan_arr[0]
return arr[0]

if isinstance(obj, DataFrame):
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -587,7 +587,7 @@ def nkeys(self) -> int:

def get_iterator(
self, data: NDFrameT, axis: AxisInt = 0
) -> Iterator[tuple[Hashable, NDFrameT]]:
) -> Iterator[tuple[Hashable, NDFrameT]]: # Doesn't work with non-hashable EA types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove

"""
Groupby iterator

Expand Down
7 changes: 4 additions & 3 deletions pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,9 +356,10 @@ def test_groupby_extension_no_sort(self):
"""
super().test_groupby_extension_no_sort()

@pytest.mark.xfail(reason="GH#39098: Converts agg result to object")
def test_groupby_agg_extension(self, data_for_grouping):
super().test_groupby_agg_extension(data_for_grouping)
# GH#39098 is now fixed, so we don't need to XFAIL, nor subclass this test
# @pytest.mark.xpass(reason="GH#39098: Converts agg result to object")
# def test_groupby_agg_extension(self, data_for_grouping):
# super().test_groupby_agg_extension(data_for_grouping)


class TestArithmeticOps(BaseJSON, base.BaseArithmeticOpsTests):
Expand Down
10 changes: 5 additions & 5 deletions pandas/tests/extension/test_boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@


def make_data():
return [True, False] * 4 + [np.nan] + [True, False] * 44 + [np.nan] + [True, False]
return [True, False] * 4 + [pd.NA] + [True, False] * 44 + [pd.NA] + [True, False]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change to track the identical changes in test_integer.py and test_floating.py. The it does not matter whether the NA values in this case are pd.NA, np.nan, or a mixture. But I did think it better to make the three files consistent. I'm happy to revert this change if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this should be removed. It's ideal to minimize extraneous changes



@pytest.fixture
Expand All @@ -48,7 +48,7 @@ def data_for_twos(dtype):

@pytest.fixture
def data_missing(dtype):
return pd.array([np.nan, True], dtype=dtype)
return pd.array([pd.NA, True], dtype=dtype)


@pytest.fixture
Expand All @@ -58,7 +58,7 @@ def data_for_sorting(dtype):

@pytest.fixture
def data_missing_for_sorting(dtype):
return pd.array([True, np.nan, False], dtype=dtype)
return pd.array([True, pd.NA, False], dtype=dtype)


@pytest.fixture
Expand All @@ -76,7 +76,7 @@ def na_value():
def data_for_grouping(dtype):
b = True
a = False
na = np.nan
na = pd.NA
return pd.array([b, b, na, na, a, a, b], dtype=dtype)


Expand Down Expand Up @@ -147,7 +147,7 @@ def _check_op(self, obj, op, other, op_name, exc=NotImplementedError):
expected = expected.astype("Float64")
if op_name == "__rpow__":
# for rpow, combine does not propagate NaN
expected[result.isna()] = np.nan
expected[result.isna()] = pd.NA
self.assert_equal(result, expected)
else:
with pytest.raises(exc):
Expand Down