Skip to content

REF: EA value_counts -> _value_counts #30673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

jbrockmendel
Copy link
Member

Instead of returning a Series, return a tuple with the index and values to be passed to Series.

Where possible I've changed the methods to use _values_for_factorized in the hopes of converging on a base class implementation. This is proving elusive, suggestions welcome. cc @TomAugspurger @jorisvandenbossche

xref #22843, #23074.

@jbrockmendel
Copy link
Member Author

@simonjayhawkins suggestions for making mypy happy here?

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 4, 2020
@jreback
Copy link
Contributor

jreback commented Jan 4, 2020

does this make possible EA.value_counts?

@jbrockmendel
Copy link
Member Author

does this make possible EA.value_counts?

Still trying to figure out what the base class implementation looks like. I think it involves _values_for_factorized, but that seems to have inconsistent copy/view semantics.

@simonjayhawkins
Copy link
Member

@simonjayhawkins suggestions for making mypy happy here?

adding

    def __init__(self, values, freq=None, dtype=None, copy=False):
        pass

to DatetimeLikeArrayMixin works. may want to make DatetimeLikeArrayMixin an abstract base class instead of mixin. The docstring states "Shared Base/Mixin class for DatetimeArray, TimedeltaArray, PeriodArray".

@jorisvandenbossche
Copy link
Member

I think it involves _values_for_factorized, but that seems to have inconsistent copy/view semantics.

Can you clarify what you mean with this?


Personally, I would even go for removing (_)value_counts entirely from the ExtensionArray (and move the logic to algorithms.value_counts. But that is of course only possible if it can be implemented generically based on _values_for_factorize (which right now is not yet possible?)

@jbrockmendel
Copy link
Member Author

I think it involves _values_for_factorized, but that seems to have inconsistent copy/view semantics.

Can you clarify what you mean with this?

Suppose we just want to go through the existing implementations and substitute in _values_for_factorize wherever feasible:

  • DTA/TDA/PA use i8values, which matches _values_for_factorize()[0]
  • BooleanArray/IntegerArray calls goes through the Index constructor, so not immediately obvious
  • Categorical uses np.arange
  • IntervallArray is equivalent to using _values_for_factorize.
  • StringArray uses self._ndarray, which is similar to _values_for_factorize, but _values_for_factorize makes a copy and then masks

_values_for_factorize copy/view behavior:

  • BooleanArray/IntegerArray does an "astype" followed by masking, so we get a copy
  • Categorical does an "astype", so we get a copy
  • DTA/TDA/PA -> view
  • PandasArray -> view
  • StringArray -> copy+mask
  • Sparse -> calls np.asarray, copy vs view varies.

@jorisvandenbossche
Copy link
Member

And how is the copy/view semantic important for a value_counts implementation?

@TomAugspurger
Copy link
Contributor

I'd also be happy to see ExtensionArray.value_counts go away if possible, in favor of pd.value_counts(values).

@jreback
Copy link
Contributor

jreback commented Jan 6, 2020

I'd also be happy to see ExtensionArray.value_counts go away if possible, in favor of pd.value_counts(values).

really? that is a giant step backwards for the api
virtually everything is a method, so now you want top level functions? where did this come from

@jbrockmendel
Copy link
Member Author

And how is the copy/view semantic important for a value_counts implementation?

If the existing implementations dont make a copy and the _values_for_factorize-based ones do, thats a performance hit that im not eager to take. Since I don't fully understand the distinction between _values_for_factorize vs _values_for_argsort, I find it worth holding off on implementing a general version until I find the right attribute.

@jorisvandenbossche
Copy link
Member

really? that is a giant step backwards for the api

The method that users can use is pd.Series.value_counts. For that to work, the EA does not need to have a value_counts method.

@jorisvandenbossche
Copy link
Member

Since I don't fully understand the distinction between _values_for_factorize vs _values_for_argsort, I find it worth holding off on implementing a general version until I find the right attribute.

Since value_counts is doing a factorization (with counting included), it's certainly _values_for_factorize we should be using for this, and not _values_for_argsort, I would think.

Anyway, regardless of the "possible general implementation" discussion, I think this PR is a step forward, so fine with first focusing on the content right now in this PR.

@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Jan 6, 2020
@TomAugspurger
Copy link
Contributor

Though in its current state, the PR breaks API right? Won'tCategorical.value_counts raise an AttributeError right now?

@jreback
Copy link
Contributor

jreback commented Jan 7, 2020

what is the reason we are so against an EA.value_counts() method? We have one now, and IIRC you didn't want this on the base class, though this PR actually makes that a very simpl impl now.

@jbrockmendel
Copy link
Member Author

what is the reason we are so against an EA.value_counts() method?

For me its about dependency structure. I don't want our EAs depending on Series/DataFrame/Index (and want to change the handful of places that they currently do)

@jbrockmendel
Copy link
Member Author

Though in its current state, the PR breaks API right? Won'tCategorical.value_counts raise an AttributeError right now?

Yes. Easy to reinstate i guess

@jorisvandenbossche
Copy link
Member

what is the reason we are so against an EA.value_counts() method?

For me its about dependency structure. I don't want our EAs depending on Series/DataFrame/Index (and want to change the handful of places that they currently do)

Yes, for me the same. And next to the actual code dependency structure, there is also the mental model: for me, Array is something independent of Series/Index, while Series/Index holds arrays. It's biizarre to me for an array method to return a Series.

Also, I doubt anyone is actually using this. Users have Series.value_counts available for this.

@jreback
Copy link
Contributor

jreback commented Jan 8, 2020

@jbrockmendel can you provide a depretion for .value_counts()?

@TomAugspurger TomAugspurger modified the milestones: 1.0, 1.1 Jan 9, 2020
@jbrockmendel
Copy link
Member Author

closing to clear the queue, will revisit after indexing fixes are done

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche I'm now trying to implement more ExtensionIndex methods in terms of the backing EA, am having trouble similar to the trouble with value_counts. Did you ever try to implement this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants