Skip to content

PERF: Reject non-string object arrays faster in factorize #51921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 13, 2023

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented Mar 13, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

A little more than 10% improvement here. Not posting asv's, since the other benchmarks are really flaky.

Idea here is that we can reject on the first non-string entry, but infer_dtype will go through the entire array, wasting comparisons.

@lithomas1 lithomas1 force-pushed the faster-factorize-object branch from 8712ed3 to ed77ca3 Compare March 13, 2023 00:29
@lithomas1 lithomas1 added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 13, 2023
@lithomas1 lithomas1 marked this pull request as ready for review March 13, 2023 02:13
Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. LGMT.

@topper-123 topper-123 added this to the 2.1 milestone Mar 13, 2023
@mroeschke mroeschke merged commit f122e2e into pandas-dev:main Mar 13, 2023
@mroeschke
Copy link
Member

Thanks @lithomas1

@lithomas1 lithomas1 deleted the faster-factorize-object branch March 14, 2023 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants