Skip to content

PERF: 10x speedup in Series/DataFrame construction for lists of ints #24647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 6, 2019

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Jan 5, 2019

This PR is a minor tweak to the int64/uint64 overflow fix added in #18624

Simply casting to an int after doing a typecheck is sufficient for the compiler to generate a 10x speedup:

$ asv compare upstream/master HEAD --sort ratio -s

Benchmarks that have improved:

       before           after         ratio
     [f074abef]       [80641ddf]
     <series_list_int_speedup~1>       <series_list_int_speedup>
           failed          7.39±0s      n/a  strings.Dummies.time_get_dummies
-        61.7±3ms       11.7±0.5ms     0.19  ctors.SeriesConstructors.time_series_constructor(<function arr_dict>, True, 'int')
-        63.0±2ms       11.1±0.3ms     0.18  ctors.SeriesConstructors.time_series_constructor(<function arr_dict>, False, 'int')
-        55.8±2ms       5.37±0.2ms     0.10  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
-        55.3±5ms       4.84±0.2ms     0.09  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')

This is how maybe_convert_numeric() already handles ints, so this just brings maybe_convert_object() back into alignment.

I believe this would yield a similar speedup for DataFrames but we don't have any benchmarks explicitly testing as such. However, the get_dummies() benchmark involves expanding to a DataFrame and gets a speedup of similar magnitude (not visible as it previously would time out after 30s).

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@jreback jreback added the Performance Memory or execution speed performance label Jan 5, 2019
@jreback jreback added this to the 0.24.0 milestone Jan 5, 2019
@jreback
Copy link
Contributor

jreback commented Jan 5, 2019

nice. ping on green.

@@ -2011,7 +2011,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
floats[i] = <float64_t>val
complexes[i] = <double complex>val
if not seen.null_:
seen.saw_int(int(val))
val = int(val)
seen.saw_int(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can saw_int's signature be tightened up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but I didn't see any significant difference when changing it to cdef inline saw_int(self, int val), so took that out to keep this minimal. That call is only used in this file, and only a few times at that so it'd be fine to tighten it up.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2019

actually there are a few more cases where the same fix could be applied

@qwhelan
Copy link
Contributor Author

qwhelan commented Jan 5, 2019

@jreback Just to clarify - the speedup comes from preventing an object-to-object comparison for val > oUINT_MAX and the like (which is avoided in saw_int(int(val)) by casting the input). I think this is the only change needed for that precise case, but happy to implement any that I'm missing here.

@codecov
Copy link

codecov bot commented Jan 5, 2019

Codecov Report

Merging #24647 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24647      +/-   ##
==========================================
- Coverage   92.37%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52377    52380       +3     
==========================================
+ Hits        48385    48387       +2     
- Misses       3992     3993       +1
Flag Coverage Δ
#multiple 90.79% <ø> (-0.01%) ⬇️
#single 43.01% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/datetimes.py 97.68% <0%> (-0.16%) ⬇️
pandas/core/arrays/timedeltas.py 88.09% <0%> (-0.16%) ⬇️
pandas/core/arrays/period.py 98.52% <0%> (-0.02%) ⬇️
pandas/core/indexes/datetimelike.py 98.52% <0%> (ø) ⬆️
pandas/util/testing.py 88.09% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 280a88f...73865f1. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Jan 5, 2019

Codecov Report

Merging #24647 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24647      +/-   ##
==========================================
- Coverage   92.37%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52377    52380       +3     
==========================================
+ Hits        48385    48387       +2     
- Misses       3992     3993       +1
Flag Coverage Δ
#multiple 90.79% <ø> (-0.01%) ⬇️
#single 43.01% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/datetimes.py 97.68% <0%> (-0.16%) ⬇️
pandas/core/arrays/timedeltas.py 88.09% <0%> (-0.16%) ⬇️
pandas/core/arrays/period.py 98.52% <0%> (-0.02%) ⬇️
pandas/core/indexes/datetimelike.py 98.52% <0%> (ø) ⬆️
pandas/util/testing.py 88.09% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 280a88f...73865f1. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2019

maybe this was the one I was looking at. ok, ping on green.

@qwhelan
Copy link
Contributor Author

qwhelan commented Jan 6, 2019

@jreback ping

@jreback jreback merged commit 48c3ce5 into pandas-dev:master Jan 6, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants