Skip to content

ENH: ExtensionArray.unique #19869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 13, 2018
6 changes: 3 additions & 3 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
ABCSeries, ABCIndex,
ABCIndexClass, ABCCategorical)
from pandas.core.dtypes.common import (
is_array_like,
is_unsigned_integer_dtype, is_signed_integer_dtype,
is_integer_dtype, is_complex_dtype,
is_object_dtype,
Expand Down Expand Up @@ -168,8 +169,7 @@ def _ensure_arraylike(values):
"""
ensure that we are arraylike if not already
"""
if not isinstance(values, (np.ndarray, ABCCategorical,
ABCIndexClass, ABCSeries)):
if not is_array_like(values):
inferred = lib.infer_dtype(values)
if inferred in ['mixed', 'string', 'unicode']:
if isinstance(values, tuple):
Expand Down Expand Up @@ -356,7 +356,7 @@ def unique(values):
# categorical is a fast-path
# this will coerce Categorical, CategoricalIndex,
# and category dtypes Series to same return of Category
if is_categorical_dtype(values):
if is_extension_array_dtype(values):
values = getattr(values, '.values', values)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, this getattr is doing nothing since it's '.values' and not 'values' :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and we have a test that depends on it! I'm going to split this off to a new issue then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nevermind. I think we can just remove it instead of correcting it.

return values.unique()

Expand Down
12 changes: 12 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,18 @@ def isna(self):
"""
raise AbstractMethodError(self)

def unique(self):
"""Compute the ExtensionArray of unique values.

Returns
-------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future PR should prob add some examples here :> (and other doc-strings).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True :) The only problem is that for ExtensionArray we don't have a direct working example, as you first need to subclass it (unless we use one of the existing ones like Categorical, but that also seems a bit strange)

uniques : ExtensionArray
"""
from pandas import unique

uniques = unique(self.astype(object))
return type(self)(uniques)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you should add a
self._shallow_copy or something instead of doing this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the copy or to avoid the type stuff?

In this specific case, I think a copy is necessary since uniques is an ndarray objects.

As for alternative constructors to avoid the type stuff, sure, just need to come up with names for them.

For my IPAddress stuff I want to be able to do zero-copy construction from

  • ExtensionArray
  • numpy structured array with the correct fields.

Putting those in separate constructors makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no its the idiom , see below, should have a _constructor in EA which just returns type(self), which is what we do in Index. this handled the sub-classing things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback What do you mean with "this handled the sub-classing things"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because everywhere else where use _constructor as the constructor class and do not use type(self) directly (yes that is the implementation, but the code all uses _constructor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, but it's not because we do it everywhere else internally that we need to do it here as well. Here we have something that is part of an external interface, and if adding something like that we should have a reason for it. And a clear message to the implementer what we expect from it.
Eg in Series/DataFrame you have interactions between different subclasses (a slice of DataFrame can give another Series subclass). That's not something we have to deal with here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just becomes confusing. I don't any reason not to have a _constructor, its obvious, consistent in the code. These ad-hoc breaks with internal consistency are just creating technical debt. Please let's have consistentcy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want me to change it to _constructor even though it'll be irrelevant after #19913 is resolved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments on that PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this now should change to _construct_from_sequence, in fact should be try to be concistent and always use this.


# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,14 @@ def test_count(self, data_missing):
def test_apply_simple_series(self, data):
result = pd.Series(data).apply(id)
assert isinstance(result, pd.Series)

@pytest.mark.parametrize('box', [pd.Series, lambda x: x])
@pytest.mark.parametrize('method', [lambda x: x.unique(), pd.unique])
def test_unique(self, data, box, method):
duplicated = box(type(data)([data[0], data[0]]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this one also be _constructor_from_sequence? (as it is a generic test to be subclassed)


result = method(duplicated)

assert len(result) == 1
assert isinstance(result, type(data))
assert result[0] == duplicated[0]
7 changes: 7 additions & 0 deletions pandas/tests/extension/json/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,13 @@ def take(self, indexer, allow_fill=True, fill_value=None):
def copy(self, deep=False):
return type(self)(self.data[:])

def unique(self):
# Parent method doesn't work since np.array will try to infer
# a 2-dim object.
return type(self)([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self._constructor rather than type(self) generally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would change this too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment, define _consturct_from_sequence

dict(x) for x in list(set(tuple(d.items()) for d in self.data))
])

@property
def _na_value(self):
return {}
Expand Down