PERF: Performance improvement value_counts for masked arrays #48338

phofl · 2022-08-31T21:44:46Z

closes #xxxx (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

ser = pd.Series([1, 2, pd.NA] + list(range(1_000_000)), dtype="Int64")
ser.value_counts(dropna=True/False)

# old dropna=False
# 99.6 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# new dropna=False
# 46.1 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# old dropna=True
# 138 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# new dropna=True
# 44.6 ms ± 439 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

data = np.array(list(range(1_000_000)))
Series(data, dtype="Int64")

# old Series(numpy_array, dtype="Int64")
# 45.8 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# new
# 63.8 µs ± 329 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

cc @jorisvandenbossche

jorisvandenbossche

Thanks, looks good!

jorisvandenbossche · 2022-09-02T20:38:43Z

asv_bench/benchmarks/series_methods.py

+        self.data = np.array(list(range(1_000_000)))
+
+    def time_constructor(self):
+        Series(data=self.data, dtype="Int64")


I think we already have a benchmark for this in array.py (IntegerArray::time_from_integer_array), except that that one is with a tiny array and thus won't cover this aspect. But maybe add a version with a larger array in the existing benchmark? (or just make the array in that benchmark bigger)

Thx, forgot to check for series after I found nothing for value_counts. Increased the array size for the existing benchmark. 4 values is probably a bit small to see anything with overhead from other calls that don't depend on array size

jorisvandenbossche · 2022-09-02T20:40:33Z

asv_bench/benchmarks/series_methods.py

@@ -166,6 +175,19 @@ def time_value_counts(self, N, dtype):
        self.s.value_counts()


+class ValueCountsEa:


Suggested change

class ValueCountsEa:

class ValueCountsEA:

? Small nitpick: not sure if we have some consistency around this, but if not, since we typically use EA as abbreviation, would keep it capitalized here

Done, will keep in mind for future prs

…_counts

…dev#48338)

PERF: Performance improvement value_counts for masked arrays

1de9e86

phofl and others added 3 commits August 31, 2022 23:45

Add gh ref

a971a04

Merge branch 'main' into perf_nullable_value_counts

468f238

Merge branch 'main' into perf_nullable_value_counts

c536467

jorisvandenbossche reviewed Sep 2, 2022

View reviewed changes

phofl added 2 commits September 2, 2022 23:31

Adress asvs

29687c1

Merge remote-tracking branch 'upstream/main' into perf_nullable_value…

64ad80e

…_counts

mroeschke approved these changes Sep 9, 2022

View reviewed changes

phofl merged commit 191557d into pandas-dev:main Sep 9, 2022

phofl deleted the perf_nullable_value_counts branch September 9, 2022 19:10

phofl added this to the 1.6 milestone Sep 9, 2022

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

PERF: Performance improvement value_counts for masked arrays (pandas-…

78da268

…dev#48338)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Performance improvement value_counts for masked arrays #48338

PERF: Performance improvement value_counts for masked arrays #48338

phofl commented Aug 31, 2022 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Sep 2, 2022

phofl Sep 2, 2022

jorisvandenbossche Sep 2, 2022

phofl Sep 2, 2022 •

edited

Loading

		@@ -166,6 +175,19 @@ def time_value_counts(self, N, dtype):
		self.s.value_counts()


		class ValueCountsEa:

PERF: Performance improvement value_counts for masked arrays #48338

PERF: Performance improvement value_counts for masked arrays #48338

Conversation

phofl commented Aug 31, 2022 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Sep 2, 2022

Choose a reason for hiding this comment

phofl Sep 2, 2022

Choose a reason for hiding this comment

jorisvandenbossche Sep 2, 2022

Choose a reason for hiding this comment

phofl Sep 2, 2022 • edited Loading

Choose a reason for hiding this comment

phofl commented Aug 31, 2022 •

edited

Loading

phofl Sep 2, 2022 •

edited

Loading