PERF: 10x speedup in Series/DataFrame construction for lists of ints #24647

qwhelan · 2019-01-05T21:31:56Z

This PR is a minor tweak to the int64/uint64 overflow fix added in #18624

Simply casting to an int after doing a typecheck is sufficient for the compiler to generate a 10x speedup:

$ asv compare upstream/master HEAD --sort ratio -s

Benchmarks that have improved:

       before           after         ratio
     [f074abef]       [80641ddf]
     <series_list_int_speedup~1>       <series_list_int_speedup>
           failed          7.39±0s      n/a  strings.Dummies.time_get_dummies
-        61.7±3ms       11.7±0.5ms     0.19  ctors.SeriesConstructors.time_series_constructor(<function arr_dict>, True, 'int')
-        63.0±2ms       11.1±0.3ms     0.18  ctors.SeriesConstructors.time_series_constructor(<function arr_dict>, False, 'int')
-        55.8±2ms       5.37±0.2ms     0.10  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
-        55.3±5ms       4.84±0.2ms     0.09  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')

This is how maybe_convert_numeric() already handles ints, so this just brings maybe_convert_object() back into alignment.

I believe this would yield a similar speedup for DataFrames but we don't have any benchmarks explicitly testing as such. However, the get_dummies() benchmark involves expanding to a DataFrame and gets a speedup of similar magnitude (not visible as it previously would time out after 30s).

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

…t_objects

jreback · 2019-01-05T21:58:02Z

nice. ping on green.

jbrockmendel · 2019-01-05T22:00:50Z

pandas/_libs/lib.pyx

@@ -2011,7 +2011,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
            floats[i] = <float64_t>val
            complexes[i] = <double complex>val
            if not seen.null_:
-                seen.saw_int(int(val))
+                val = int(val)
+                seen.saw_int(val)


can saw_int's signature be tightened up?

It can, but I didn't see any significant difference when changing it to cdef inline saw_int(self, int val), so took that out to keep this minimal. That call is only used in this file, and only a few times at that so it'd be fine to tighten it up.

jreback · 2019-01-05T22:04:16Z

actually there are a few more cases where the same fix could be applied

qwhelan · 2019-01-05T22:34:22Z

@jreback Just to clarify - the speedup comes from preventing an object-to-object comparison for val > oUINT_MAX and the like (which is avoided in saw_int(int(val)) by casting the input). I think this is the only change needed for that precise case, but happy to implement any that I'm missing here.

codecov · 2019-01-05T22:38:39Z

Codecov Report

Merging #24647 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #24647      +/-   ##
==========================================
- Coverage   92.37%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52377    52380       +3     
==========================================
+ Hits        48385    48387       +2     
- Misses       3992     3993       +1

Flag	Coverage Δ
#multiple	`90.79% <ø> (-0.01%)`	⬇️
#single	`43.01% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/datetimes.py	`97.68% <0%> (-0.16%)`	⬇️
pandas/core/arrays/timedeltas.py	`88.09% <0%> (-0.16%)`	⬇️
pandas/core/arrays/period.py	`98.52% <0%> (-0.02%)`	⬇️
pandas/core/indexes/datetimelike.py	`98.52% <0%> (ø)`	⬆️
pandas/util/testing.py	`88.09% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 280a88f...73865f1. Read the comment docs.

codecov · 2019-01-05T22:38:39Z

Codecov Report

Merging #24647 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #24647      +/-   ##
==========================================
- Coverage   92.37%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52377    52380       +3     
==========================================
+ Hits        48385    48387       +2     
- Misses       3992     3993       +1

Flag	Coverage Δ
#multiple	`90.79% <ø> (-0.01%)`	⬇️
#single	`43.01% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/datetimes.py	`97.68% <0%> (-0.16%)`	⬇️
pandas/core/arrays/timedeltas.py	`88.09% <0%> (-0.16%)`	⬇️
pandas/core/arrays/period.py	`98.52% <0%> (-0.02%)`	⬇️
pandas/core/indexes/datetimelike.py	`98.52% <0%> (ø)`	⬆️
pandas/util/testing.py	`88.09% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 280a88f...73865f1. Read the comment docs.

jreback · 2019-01-05T22:47:11Z

maybe this was the one I was looking at. ok, ping on green.

qwhelan · 2019-01-06T03:41:49Z

@jreback ping

…t_objects (pandas-dev#24647)

PERF: provide additional type information to compiler in maybe_conver…

73865f1

…t_objects

jreback added the Performance Memory or execution speed performance label Jan 5, 2019

jreback added this to the 0.24.0 milestone Jan 5, 2019

jbrockmendel reviewed Jan 5, 2019

View reviewed changes

jreback merged commit 48c3ce5 into pandas-dev:master Jan 6, 2019

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF: provide additional type information to compiler in maybe_conver…

aa87ec9

…t_objects (pandas-dev#24647)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF: provide additional type information to compiler in maybe_conver…

04efc21

…t_objects (pandas-dev#24647)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: 10x speedup in Series/DataFrame construction for lists of ints #24647

PERF: 10x speedup in Series/DataFrame construction for lists of ints #24647

Uh oh!

qwhelan commented Jan 5, 2019 •

edited

Loading

Uh oh!

jreback commented Jan 5, 2019

Uh oh!

jbrockmendel Jan 5, 2019

Uh oh!

qwhelan Jan 5, 2019

Uh oh!

jreback commented Jan 5, 2019

Uh oh!

qwhelan commented Jan 5, 2019

Uh oh!

codecov bot commented Jan 5, 2019

Uh oh!

codecov bot commented Jan 5, 2019 •

edited

Loading

Uh oh!

jreback commented Jan 5, 2019

Uh oh!

qwhelan commented Jan 6, 2019

Uh oh!

Uh oh!

Uh oh!

PERF: 10x speedup in Series/DataFrame construction for lists of ints #24647

PERF: 10x speedup in Series/DataFrame construction for lists of ints #24647

Uh oh!

Conversation

qwhelan commented Jan 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Jan 5, 2019

Uh oh!

jbrockmendel Jan 5, 2019

Choose a reason for hiding this comment

Uh oh!

qwhelan Jan 5, 2019

Choose a reason for hiding this comment

Uh oh!

jreback commented Jan 5, 2019

Uh oh!

qwhelan commented Jan 5, 2019

Uh oh!

codecov bot commented Jan 5, 2019

Codecov Report

Uh oh!

codecov bot commented Jan 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback commented Jan 5, 2019

Uh oh!

qwhelan commented Jan 6, 2019

Uh oh!

Uh oh!

qwhelan commented Jan 5, 2019 •

edited

Loading

codecov bot commented Jan 5, 2019 •

edited

Loading