Skip to content

ENH: Support EAs in Series.unstack #23284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Nov 7, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
ced299f
ENH: Support EAs in Series.unstack
TomAugspurger Oct 12, 2018
3b63fcb
release note
TomAugspurger Oct 22, 2018
756dde9
xfail
TomAugspurger Oct 22, 2018
90f84ef
spelling
TomAugspurger Oct 22, 2018
942db1b
lint
TomAugspurger Oct 22, 2018
36a4450
no copy
TomAugspurger Oct 23, 2018
ee330d6
Fixup decimal tests
TomAugspurger Oct 23, 2018
2fcaf4d
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 23, 2018
4f46364
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 23, 2018
e9498a1
update
TomAugspurger Oct 23, 2018
72b5a0d
handle names
TomAugspurger Oct 24, 2018
f6b2050
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 24, 2018
4d679cb
lint
TomAugspurger Oct 24, 2018
ff7aba7
handle DataFrame.unstack
TomAugspurger Oct 24, 2018
91587cb
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 24, 2018
49bdb50
handle DataFrame.unstack
TomAugspurger Oct 24, 2018
cf8ed73
handle DataFrame.unstack
TomAugspurger Oct 24, 2018
5902b5b
Slightly de-hackify
TomAugspurger Oct 24, 2018
17d3002
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 24, 2018
a75806a
docs, comments
TomAugspurger Oct 26, 2018
2397e89
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 26, 2018
8ed7c73
unxfail test
TomAugspurger Oct 26, 2018
b23234c
added benchmark
TomAugspurger Oct 26, 2018
29a6bb1
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Oct 29, 2018
19b7cfa
fix asv
TomAugspurger Oct 29, 2018
254fe52
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Nov 5, 2018
2d78d42
CLN: remove dead code
TomAugspurger Nov 5, 2018
a9e6263
faster asv
TomAugspurger Nov 5, 2018
ca286f7
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Nov 6, 2018
2f28638
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Nov 6, 2018
967c674
API: decimal nan is na
TomAugspurger Nov 6, 2018
f6aa4b9
Merge remote-tracking branch 'upstream/master' into ea-unstack
TomAugspurger Nov 6, 2018
32bc3de
Revert "API: decimal nan is na"
TomAugspurger Nov 6, 2018
56e5f2f
Fixed sparse test
TomAugspurger Nov 6, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -724,6 +724,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? what does this change for Categorical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously Series[Categorical].unstack() returned DataFrame[object].

Now it'll be a DataFrame[Categorical], i.e. unstack() preserves the CategoricalDtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forget. Previously, we went internally went Categorical -> object -> Categorical. Now we avoid the conversion to categorical.

So the changes from 0.23.4 will be

  1. Series[category].unstack() avoids a conversion to object
  2. Series[Sparse].unstack is sparse (no intermediate conversion to dense)

Onces DatetimeTZ is an ExtensionArray, then we'll presumably preserve that as well. On 0.23.4, we convert to datetime64ns

In [48]: index = pd.MultiIndex.from_tuples([('A', 0), ('A', 1), ('B', 1)])

In [49]: ser = pd.Series(pd.date_range('2000', periods=3, tz="US/Central"), index=index)

In [50]: ser.unstack().dtypes
Out[50]:
0    datetime64[ns]
1    datetime64[ns]
dtype: object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this might be need a larger note then


.. _whatsnew_0240.api.incompatibilities:

Expand Down
22 changes: 22 additions & 0 deletions pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@ def _unstack_multiple(data, clocs, fill_value=None):
if isinstance(data, Series):
dummy = data.copy()
dummy.index = dummy_index

unstacked = dummy.unstack('__placeholder__', fill_value=fill_value)
new_levels = clevels
new_names = cnames
Expand Down Expand Up @@ -399,6 +400,8 @@ def unstack(obj, level, fill_value=None):
else:
return obj.T.stack(dropna=False)
else:
if is_extension_array_dtype(obj.dtype):
return unstack_extension_series(obj, level, fill_value)
unstacker = _Unstacker(obj.values, obj.index, level=level,
fill_value=fill_value,
constructor=obj._constructor_expanddim)
Expand Down Expand Up @@ -947,3 +950,22 @@ def make_axis_dummies(frame, axis='minor', transform=None):
values = values.take(labels, axis=0)

return DataFrame(values, columns=items, index=frame.index)


def unstack_extension_series(series, level, fill_value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this function up to around line 424? It looks like this file has all unstack related code grouped together first, followed by stack code grouped together, so having unstack_extension_series at the bottom seems a little out of place.

from pandas.core.reshape.concat import concat

dummy_arr = np.arange(len(series))
# fill_value=-1, since we will do a series.values.take later
result = _Unstacker(dummy_arr, series.index,
level=level, fill_value=-1).get_result()

out = []
values = series.values

for col, indices in result.iteritems():
out.append(Series(values.take(indices.values,
allow_fill=True,
fill_value=fill_value),
name=col, index=result.index))
return concat(out, axis='columns')
38 changes: 38 additions & 0 deletions pandas/tests/extension/base/reshaping.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import itertools
import pytest
import numpy as np

Expand Down Expand Up @@ -170,3 +171,40 @@ def test_merge(self, data, na_value):
[data[0], data[0], data[1], data[2], na_value],
dtype=data.dtype)})
self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']])

@pytest.mark.parametrize("index", [
pd.MultiIndex.from_product(([['A', 'B'], ['a', 'b']])),
pd.MultiIndex.from_product(([['A', 'B'], ['a', 'b'], ['x', 'y', 'z']])),

# non-uniform
pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b'), ('B', 'b')]),

# three levels, non-uniform
pd.MultiIndex.from_product([('A', 'B'), ('a', 'b', 'c'), (0, 1, 2)]),
pd.MultiIndex.from_tuples([
('A', 'a', 1),
('A', 'b', 0),
('A', 'a', 0),
('B', 'a', 0),
('B', 'c', 1),
]),
])
def test_unstack(self, data, index):
data = data[:len(index)]
ser = pd.Series(data, index=index)

n = index.nlevels
levels = list(range(n))
# [0, 1, 2]
# -> [(0,), (1,), (2,) (0, 1), (1, 0)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be -> [(0,), (1,), (2,), (0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]? Not super important, but caused me a brief moment of confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're correct.

combinations = itertools.chain.from_iterable(
itertools.permutations(levels, i) for i in range(1, n)
)

for level in combinations:
result = ser.unstack(level=level)
assert all(isinstance(result[col].values, type(data)) for col in result.columns)
expected = ser.astype(object).unstack(level=level)
result = result.astype(object)

self.assert_frame_equal(result, expected)
5 changes: 4 additions & 1 deletion pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,10 @@ def copy(self, deep=False):
def astype(self, dtype, copy=True):
if isinstance(dtype, type(self.dtype)):
return type(self)(self._data, context=dtype.context)
return super(DecimalArray, self).astype(dtype, copy)
# need to replace decimal NA
Copy link
Contributor Author

@TomAugspurger TomAugspurger Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series.equal doesn't consider Series([np.nan]) == Series([Decimal('NaN')]). I made this change mainly to facilitate that.

result = np.asarray(self, dtype=dtype)
result[self.isna()] = np.nan
return result

def __setitem__(self, key, value):
if pd.api.types.is_list_like(value):
Expand Down
6 changes: 5 additions & 1 deletion pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,11 @@ def test_from_dtype(self, data):


class TestReshaping(BaseJSON, base.BaseReshapingTests):
pass
@pytest.mark.xfail(reason="dict for NA", strict=True)
def test_unstack(self, data, index):
# The base test has NaN for the expected NA value.
# this matches otherwise
return super().test_unstack(data, index)


class TestGetitem(BaseJSON, base.BaseGetitemTests):
Expand Down