Skip to content

ENH: Implement masked algorithm for value_counts #54984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 30, 2023
Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Sep 3, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

this adapts the same methods we are using for unique and friends. We get a 5-10% performance improvement as the Number of NAs grows

@phofl phofl requested a review from WillAyd as a code owner September 3, 2023 17:54
@mroeschke mroeschke added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 5, 2023
@WillAyd
Copy link
Member

WillAyd commented Sep 6, 2023

Does this close any open issues? Maybe #44946?

@phofl
Copy link
Member Author

phofl commented Sep 7, 2023

There is an issue that includes all of our algorithms (mode is missing), we can close after I finish with that

@phofl phofl added this to the 2.2 milestone Sep 10, 2023
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@phofl
Copy link
Member Author

phofl commented Sep 30, 2023

merging this

@phofl phofl merged commit 6f0cd8d into pandas-dev:main Sep 30, 2023
@phofl phofl deleted the 54984 branch September 30, 2023 18:19
@rhshadrach
Copy link
Member

This patch may have induced a performance regression. If it was a necessary behavior change, this may have been expected and everything is okay.

Please check the links below. If any ASVs are parameterized, the combinations of parameters that a regression has been detected for appear as subbullets.

Subsequent benchmarks may have skipped some commits. The link below lists the commits that are between the two benchmark runs where the regression was identified.

c8e7a98...6f0cd8d

@phofl
Copy link
Member Author

phofl commented Oct 21, 2023

I agree that the regression look valid based on ASV, but 2 things point in another direction:

  • I can not reproduce any of these regressions locally
  • this pr didn't touch the code-path of isin at all

It's just very weird, not sure what to do

@rhshadrach
Copy link
Member

Ah, I didn't think to check if a dependency changed. Will check that.

@phofl
Copy link
Member Author

phofl commented Oct 21, 2023

This is mostly going through our own cython code, but nothing that I’ve touched. It’s very weird

phofl added a commit to phofl/pandas that referenced this pull request Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants