Skip to content

ENH: Support EAs in Series.unstack #23284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Nov 7, 2018
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
ced299f
ENH: Support EAs in Series.unstack
TomAugspurger Oct 12, 2018
3b63fcb
release note
TomAugspurger Oct 22, 2018
756dde9
xfail
TomAugspurger Oct 22, 2018
90f84ef
spelling
TomAugspurger Oct 22, 2018
942db1b
lint
TomAugspurger Oct 22, 2018
36a4450
no copy
TomAugspurger Oct 23, 2018
ee330d6
Fixup decimal tests
TomAugspurger Oct 23, 2018
2fcaf4d
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 23, 2018
4f46364
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 23, 2018
e9498a1
update
TomAugspurger Oct 23, 2018
72b5a0d
handle names
TomAugspurger Oct 24, 2018
f6b2050
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 24, 2018
4d679cb
lint
TomAugspurger Oct 24, 2018
ff7aba7
handle DataFrame.unstack
TomAugspurger Oct 24, 2018
91587cb
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 24, 2018
49bdb50
handle DataFrame.unstack
TomAugspurger Oct 24, 2018
cf8ed73
handle DataFrame.unstack
TomAugspurger Oct 24, 2018
5902b5b
Slightly de-hackify
TomAugspurger Oct 24, 2018
17d3002
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 24, 2018
a75806a
docs, comments
TomAugspurger Oct 26, 2018
2397e89
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 26, 2018
8ed7c73
unxfail test
TomAugspurger Oct 26, 2018
b23234c
added benchmark
TomAugspurger Oct 26, 2018
29a6bb1
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 29, 2018
19b7cfa
fix asv
TomAugspurger Oct 29, 2018
254fe52
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Nov 5, 2018
2d78d42
CLN: remove dead code
TomAugspurger Nov 5, 2018
a9e6263
faster asv
TomAugspurger Nov 5, 2018
ca286f7
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Nov 6, 2018
2f28638
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Nov 6, 2018
967c674
API: decimal nan is na
TomAugspurger Nov 6, 2018
f6aa4b9
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Nov 6, 2018
32bc3de
Revert "API: decimal nan is na"
TomAugspurger Nov 6, 2018
56e5f2f
Fixed sparse test
TomAugspurger Nov 6, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -807,6 +807,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? what does this change for Categorical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously Series[Categorical].unstack() returned DataFrame[object].

Now it'll be a DataFrame[Categorical], i.e. unstack() preserves the CategoricalDtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forget. Previously, we went internally went Categorical -> object -> Categorical. Now we avoid the conversion to categorical.

So the changes from 0.23.4 will be

  1. Series[category].unstack() avoids a conversion to object
  2. Series[Sparse].unstack is sparse (no intermediate conversion to dense)

Onces DatetimeTZ is an ExtensionArray, then we'll presumably preserve that as well. On 0.23.4, we convert to datetime64ns

In [48]: index = pd.MultiIndex.from_tuples([('A', 0), ('A', 1), ('B', 1)])

In [49]: ser = pd.Series(pd.date_range('2000', periods=3, tz="US/Central"), index=index)

In [50]: ser.unstack().dtypes
Out[50]:
0    datetime64[ns]
1    datetime64[ns]
dtype: object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this might be need a larger note then


.. _whatsnew_0240.api.incompatibilities:

Expand Down Expand Up @@ -974,6 +975,7 @@ Categorical
- Bug when indexing with a boolean-valued ``Categorical``. Now a boolean-valued ``Categorical`` is treated as a boolean mask (:issue:`22665`)
- Constructing a :class:`CategoricalIndex` with empty values and boolean categories was raising a ``ValueError`` after a change to dtype coercion (:issue:`22702`).
- Bug in :meth:`Categorical.take` with a user-provided ``fill_value`` not encoding the ``fill_value``, which could result in a ``ValueError``, incorrect results, or a segmentation fault (:issue:`23296`).
- In meth:`Series.unstack`, specifying a ``fill_value`` not present in the categories now raises a ``TypeError`` rather than ignoring the ``fill_value`` (:issue:`23284`)

Datetimelike
^^^^^^^^^^^^
Expand Down
49 changes: 49 additions & 0 deletions pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@ def _unstack_multiple(data, clocs, fill_value=None):
if isinstance(data, Series):
dummy = data.copy()
dummy.index = dummy_index

unstacked = dummy.unstack('__placeholder__', fill_value=fill_value)
new_levels = clevels
new_names = cnames
Expand Down Expand Up @@ -399,6 +400,8 @@ def unstack(obj, level, fill_value=None):
else:
return obj.T.stack(dropna=False)
else:
if is_extension_array_dtype(obj.dtype):
return _unstack_extension_series(obj, level, fill_value)
unstacker = _Unstacker(obj.values, obj.index, level=level,
fill_value=fill_value,
constructor=obj._constructor_expanddim)
Expand All @@ -419,6 +422,52 @@ def _unstack_frame(obj, level, fill_value=None):
return unstacker.get_result()


def _unstack_extension_series(series, level, fill_value):
"""
Unstack an ExtensionArray-backed Series.

The ExtensionDtype is preserved.

Parameters
----------
series : Series
A Series with an ExtensionArray for values
level : Any
The level name or number.
fill_value : Any
The user-level (not physical storage) fill value to use for
missing values introduced by the reshape. Passed to
``series.values.take``.

Returns
-------
DataFrame
Each column of the DataFrame will have the same dtype as
the input Series.
"""
# Implementation note: the basic idea is to
# 1. Do a regular unstack on a dummy array of integers
# 2. Followup with a columnwise take.
# We use the dummy take to discover newly-created missing values
# introduced by the reshape.
from pandas.core.reshape.concat import concat

dummy_arr = np.arange(len(series))
# fill_value=-1, since we will do a series.values.take later
result = _Unstacker(dummy_arr, series.index,
level=level, fill_value=-1).get_result()

out = []
values = series.values

for col, indices in result.iteritems():
out.append(Series(values.take(indices.values,
allow_fill=True,
fill_value=fill_value),
name=col, index=result.index))
return concat(out, axis='columns', copy=False, keys=result.columns)


def stack(frame, level=-1, dropna=True):
"""
Convert DataFrame to Series with multi-level Index. Columns become the
Expand Down
41 changes: 41 additions & 0 deletions pandas/tests/extension/base/reshaping.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import itertools
import pytest
import numpy as np

Expand Down Expand Up @@ -170,3 +171,43 @@ def test_merge(self, data, na_value):
[data[0], data[0], data[1], data[2], na_value],
dtype=data.dtype)})
self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']])

@pytest.mark.parametrize("index", [
pd.MultiIndex.from_product(([['A', 'B'], ['a', 'b']]),
names=['a', 'b']),
pd.MultiIndex.from_product(([['A', 'B'], ['a', 'b'],
['x', 'y', 'z']])),

# non-uniform
pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b'), ('B', 'b')]),

# three levels, non-uniform
pd.MultiIndex.from_product([('A', 'B'), ('a', 'b', 'c'), (0, 1, 2)]),
pd.MultiIndex.from_tuples([
('A', 'a', 1),
('A', 'b', 0),
('A', 'a', 0),
('B', 'a', 0),
('B', 'c', 1),
]),
])
def test_unstack(self, data, index):
data = data[:len(index)]
ser = pd.Series(data, index=index)

n = index.nlevels
levels = list(range(n))
# [0, 1, 2]
# [(0,), (1,), (2,), (0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]
combinations = itertools.chain.from_iterable(
itertools.permutations(levels, i) for i in range(1, n)
)

for level in combinations:
result = ser.unstack(level=level)
assert all(isinstance(result[col].values, type(data))
for col in result.columns)
expected = ser.astype(object).unstack(level=level)
result = result.astype(object)

self.assert_frame_equal(result, expected)
2 changes: 1 addition & 1 deletion pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def copy(self, deep=False):
def astype(self, dtype, copy=True):
if isinstance(dtype, type(self.dtype)):
return type(self)(self._data, context=dtype.context)
return super(DecimalArray, self).astype(dtype, copy)
return np.asarray(self, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OT: how much testing do we have on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bit specifically? We have a base test for astype to object and series.

I'm not actually sure why I changed this...


def __setitem__(self, key, value):
if pd.api.types.is_list_like(value):
Expand Down
23 changes: 19 additions & 4 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import operator
import decimal
import math
import operator

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -63,9 +64,23 @@ def data_for_grouping():
class BaseDecimal(object):

def assert_series_equal(self, left, right, *args, **kwargs):

left_na = left.isna()
right_na = right.isna()
def convert(x):
# need to convert array([Decimal(NaN)], dtype='object') to np.NaN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we could add this support I think though, create an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we have a pretty clear answer of "decimal isn't supported". I'm hesitant to even partially support it :)

# because Series[object].isnan doesn't recognize decimal(NaN) as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this come up now? e.g. whats an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [21]: import pandas as pd

In [22]: import pandas.util.testing as tm

In [23]: from pandas.tests.extension.decimal import to_decimal

In [24]: ser = pd.Series(to_decimal(['1.0', 'NaN']))

In [25]: tm.assert_series_equal(ser.astype(object), ser.astype(object))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-25-5356adf50e72> in <module>
----> 1 tm.assert_series_equal(ser.astype(object), ser.astype(object))

~/sandbox/pandas/pandas/util/testing.py in assert_series_equal(left, right, check_dtype, check_index_type, check_series_type, check_less_precise, check_names, check_exact, check_datetimelike_compat, check_categorical, obj)
   1293                                      check_less_precise=check_less_precise,
   1294                                      check_dtype=check_dtype,
-> 1295                                      obj='{obj}'.format(obj=obj))
   1296
   1297     # metadata comparison

~/sandbox/pandas/pandas/_libs/testing.pyx in pandas._libs.testing.assert_almost_equal()
     64
     65
---> 66 cpdef assert_almost_equal(a, b,
     67                           check_less_precise=False,
     68                           bint check_dtype=True,

~/sandbox/pandas/pandas/_libs/testing.pyx in pandas._libs.testing.assert_almost_equal()
    178             msg = '{0} values are different ({1} %)'.format(
    179                 obj, np.round(diff * 100.0 / na, 5))
--> 180             raise_assert_detail(obj, msg, lobj, robj)
    181
    182         return True

~/sandbox/pandas/pandas/util/testing.py in raise_assert_detail(obj, message, left, right, diff)
   1080         msg += "\n[diff]: {diff}".format(diff=diff)
   1081
-> 1082     raise AssertionError(msg)
   1083
   1084

AssertionError: Series are different

Series values are different (50.0 %)
[left]:  [1.0, NaN]
[right]: [1.0, NaN]

We do the astype(object) to build the expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug, and would solve the aboe i think.

In [7]: ser.astype(object).isna()
Out[7]: 
0    False
1    False
dtype: bool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger actually, can you explain why the current code in master is not working fine? Why do you need to convert to object? Because before, there was already the calls to isna to check NaNs and non-NaNs separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is my point. I think there is a bug somewhere here, e.g. isna is maybe not dispatching to the EA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's all about creating the expected result. When we go to do the final assert that the values match, we do

            result = result.astype(object)

            self.assert_frame_equal(result, expected)

but self.assert_frame_equal will say that Series([Decimal('NaN')], dtype='object') isn't equal to itself, since it doesn't consider that value NA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, I somehow thought you were doing the astype(object) inside the assert testing machinery above, not in the actual expected result. Yes, that makes sense now.

For me it is fine to keep this hack in here for now. In the end that is somehow the purpose of using the class instances for assert_.._equal so a specific EA can override it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the .astype(object) for Decimal is incorrect.

In [5]: ser.values
Out[5]: DecimalArray(array([Decimal('1.0'), Decimal('NaN')], dtype=object))

In [6]: ser.astype(object)
Out[6]: 
0    1.0
1    NaN
dtype: object

In [7]: ser.astype(object).values
Out[7]: array([Decimal('1.0'), Decimal('NaN')], dtype=object)

on the these should be converted to np.nan and not Decimal('NaN') I think as this is just a numpy array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object-dtype can store anything, including decimal objects. It'd be strange to only convert Decimal("NaN") to np.nan, and not Decimal('1.0') to 1.0, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems correct to me.

Going to merge, if we need to further discuss this, we can do that in another issue (it's not really related anymore with actually fixing unstack)

# NA.
try:
return math.isnan(x)
except TypeError:
return False

if left.dtype == 'object':
left_na = left.apply(convert)
else:
left_na = left.isna()
if right.dtype == 'object':
right_na = right.apply(convert)
else:
right_na = right.isna()

tm.assert_series_equal(left_na, right_na)
return tm.assert_series_equal(left[~left_na],
Expand Down
6 changes: 5 additions & 1 deletion pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,11 @@ def test_from_dtype(self, data):


class TestReshaping(BaseJSON, base.BaseReshapingTests):
pass
@pytest.mark.xfail(reason="dict for NA", strict=True)
def test_unstack(self, data, index):
# The base test has NaN for the expected NA value.
# this matches otherwise
return super().test_unstack(data, index)


class TestGetitem(BaseJSON, base.BaseGetitemTests):
Expand Down
10 changes: 6 additions & 4 deletions pandas/tests/frame/test_reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,8 @@ def test_unstack_fill_frame_categorical(self):
# Test unstacking with categorical
data = pd.Series(['a', 'b', 'c', 'a'], dtype='category')
data.index = pd.MultiIndex.from_tuples(
[('x', 'a'), ('x', 'b'), ('y', 'b'), ('z', 'a')])
[('x', 'a'), ('x', 'b'), ('y', 'b'), ('z', 'a')],
)

# By default missing values will be NaN
result = data.unstack()
Expand All @@ -314,9 +315,10 @@ def test_unstack_fill_frame_categorical(self):
index=list('xyz'))
assert_frame_equal(result, expected)

# Fill with non-category results in NaN entries similar to above
result = data.unstack(fill_value='d')
assert_frame_equal(result, expected)
# Fill with non-category results in a TypeError
msg = r"'fill_value' \('d'\) is not in"
with tm.assert_raises_regex(TypeError, msg):
data.unstack(fill_value='d')

# Fill with category value replaces missing values as expected
result = data.unstack(fill_value='c')
Expand Down