REF: EA value_counts -> _value_counts #30673

jbrockmendel · 2020-01-04T03:15:35Z

Instead of returning a Series, return a tuple with the index and values to be passed to Series.

Where possible I've changed the methods to use _values_for_factorized in the hopes of converging on a base class implementation. This is proving elusive, suggestions welcome. cc @TomAugspurger @jorisvandenbossche

xref #22843, #23074.

…f-vcs

jbrockmendel · 2020-01-04T17:31:37Z

@simonjayhawkins suggestions for making mypy happy here?

jreback · 2020-01-04T17:44:14Z

does this make possible EA.value_counts?

jbrockmendel · 2020-01-04T17:47:29Z

does this make possible EA.value_counts?

Still trying to figure out what the base class implementation looks like. I think it involves _values_for_factorized, but that seems to have inconsistent copy/view semantics.

simonjayhawkins · 2020-01-05T11:31:38Z

@simonjayhawkins suggestions for making mypy happy here?

adding

    def __init__(self, values, freq=None, dtype=None, copy=False):
        pass

to DatetimeLikeArrayMixin works. may want to make DatetimeLikeArrayMixin an abstract base class instead of mixin. The docstring states "Shared Base/Mixin class for DatetimeArray, TimedeltaArray, PeriodArray".

jorisvandenbossche · 2020-01-06T07:42:06Z

I think it involves _values_for_factorized, but that seems to have inconsistent copy/view semantics.

Can you clarify what you mean with this?

Personally, I would even go for removing (_)value_counts entirely from the ExtensionArray (and move the logic to algorithms.value_counts. But that is of course only possible if it can be implemented generically based on _values_for_factorize (which right now is not yet possible?)

jbrockmendel · 2020-01-06T16:01:49Z

I think it involves _values_for_factorized, but that seems to have inconsistent copy/view semantics.

Can you clarify what you mean with this?

Suppose we just want to go through the existing implementations and substitute in _values_for_factorize wherever feasible:

DTA/TDA/PA use i8values, which matches _values_for_factorize()[0]
BooleanArray/IntegerArray calls goes through the Index constructor, so not immediately obvious
Categorical uses np.arange
IntervallArray is equivalent to using _values_for_factorize.
StringArray uses self._ndarray, which is similar to _values_for_factorize, but _values_for_factorize makes a copy and then masks

_values_for_factorize copy/view behavior:

BooleanArray/IntegerArray does an "astype" followed by masking, so we get a copy
Categorical does an "astype", so we get a copy
DTA/TDA/PA -> view
PandasArray -> view
StringArray -> copy+mask
Sparse -> calls np.asarray, copy vs view varies.

…f-vcs

jorisvandenbossche · 2020-01-06T19:36:03Z

And how is the copy/view semantic important for a value_counts implementation?

TomAugspurger · 2020-01-06T20:52:53Z

I'd also be happy to see ExtensionArray.value_counts go away if possible, in favor of pd.value_counts(values).

jreback · 2020-01-06T21:02:15Z

I'd also be happy to see ExtensionArray.value_counts go away if possible, in favor of pd.value_counts(values).

really? that is a giant step backwards for the api
virtually everything is a method, so now you want top level functions? where did this come from

jbrockmendel · 2020-01-06T21:03:20Z

And how is the copy/view semantic important for a value_counts implementation?

If the existing implementations dont make a copy and the _values_for_factorize-based ones do, thats a performance hit that im not eager to take. Since I don't fully understand the distinction between _values_for_factorize vs _values_for_argsort, I find it worth holding off on implementing a general version until I find the right attribute.

jorisvandenbossche · 2020-01-06T21:06:38Z

really? that is a giant step backwards for the api

The method that users can use is pd.Series.value_counts. For that to work, the EA does not need to have a value_counts method.

jorisvandenbossche · 2020-01-06T21:09:32Z

Since I don't fully understand the distinction between _values_for_factorize vs _values_for_argsort, I find it worth holding off on implementing a general version until I find the right attribute.

Since value_counts is doing a factorization (with counting included), it's certainly _values_for_factorize we should be using for this, and not _values_for_argsort, I would think.

Anyway, regardless of the "possible general implementation" discussion, I think this PR is a step forward, so fine with first focusing on the content right now in this PR.

TomAugspurger · 2020-01-06T21:30:37Z

Though in its current state, the PR breaks API right? Won'tCategorical.value_counts raise an AttributeError right now?

jreback · 2020-01-07T00:35:49Z

what is the reason we are so against an EA.value_counts() method? We have one now, and IIRC you didn't want this on the base class, though this PR actually makes that a very simpl impl now.

jbrockmendel · 2020-01-07T01:54:46Z

what is the reason we are so against an EA.value_counts() method?

For me its about dependency structure. I don't want our EAs depending on Series/DataFrame/Index (and want to change the handful of places that they currently do)

jbrockmendel · 2020-01-07T01:55:13Z

Though in its current state, the PR breaks API right? Won'tCategorical.value_counts raise an AttributeError right now?

Yes. Easy to reinstate i guess

jorisvandenbossche · 2020-01-08T16:17:17Z

what is the reason we are so against an EA.value_counts() method?

For me its about dependency structure. I don't want our EAs depending on Series/DataFrame/Index (and want to change the handful of places that they currently do)

Yes, for me the same. And next to the actual code dependency structure, there is also the mental model: for me, Array is something independent of Series/Index, while Series/Index holds arrays. It's biizarre to me for an array method to return a Series.

Also, I doubt anyone is actually using this. Users have Series.value_counts available for this.

jreback · 2020-01-08T18:39:30Z

@jbrockmendel can you provide a depretion for .value_counts()?

jbrockmendel · 2020-01-18T03:24:36Z

closing to clear the queue, will revisit after indexing fixes are done

jbrockmendel · 2020-02-26T19:29:59Z

@jorisvandenbossche I'm now trying to implement more ExtensionIndex methods in terms of the backing EA, am having trouble similar to the trouble with value_counts. Did you ever try to implement this?

jbrockmendel added 5 commits January 3, 2020 19:05

REF: EA value_counts -> _value_counts

6e15159

remove docsttringd out code

2a469d7

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

bc089db

…f-vcs

troubleshoot 32 bit build

e785b7e

restore cast

6140690

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 4, 2020

jbrockmendel added 2 commits January 6, 2020 08:28

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

3775fbf

…f-vcs

mypy fixup

1857156

jorisvandenbossche added this to the 1.0 milestone Jan 6, 2020

TomAugspurger modified the milestones: 1.0, 1.1 Jan 9, 2020

jbrockmendel closed this Jan 18, 2020

jbrockmendel mentioned this pull request Mar 6, 2020

CLN: use _values_for_argsort for join_non_unique, join_monotonic #32467

Merged

jbrockmendel deleted the ref-vcs branch September 21, 2020 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: EA value_counts -> _value_counts #30673

REF: EA value_counts -> _value_counts #30673

jbrockmendel commented Jan 4, 2020

jbrockmendel commented Jan 4, 2020

jreback commented Jan 4, 2020

jbrockmendel commented Jan 4, 2020

simonjayhawkins commented Jan 5, 2020

jorisvandenbossche commented Jan 6, 2020

jbrockmendel commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020

jreback commented Jan 6, 2020

jbrockmendel commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020

jreback commented Jan 7, 2020

jbrockmendel commented Jan 7, 2020

jbrockmendel commented Jan 7, 2020

jorisvandenbossche commented Jan 8, 2020

jreback commented Jan 8, 2020

jbrockmendel commented Jan 18, 2020

jbrockmendel commented Feb 26, 2020

REF: EA value_counts -> _value_counts #30673

REF: EA value_counts -> _value_counts #30673

Conversation

jbrockmendel commented Jan 4, 2020

jbrockmendel commented Jan 4, 2020

jreback commented Jan 4, 2020

jbrockmendel commented Jan 4, 2020

simonjayhawkins commented Jan 5, 2020

jorisvandenbossche commented Jan 6, 2020

jbrockmendel commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020

jreback commented Jan 6, 2020

jbrockmendel commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

jorisvandenbossche commented Jan 6, 2020

TomAugspurger commented Jan 6, 2020

jreback commented Jan 7, 2020

jbrockmendel commented Jan 7, 2020

jbrockmendel commented Jan 7, 2020

jorisvandenbossche commented Jan 8, 2020

jreback commented Jan 8, 2020

jbrockmendel commented Jan 18, 2020

jbrockmendel commented Feb 26, 2020