Skip to content

Improve dictionary map performance on category series, fixes #23785 #26015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 1, 2019

Conversation

rtlee9
Copy link
Contributor

@rtlee9 rtlee9 commented Apr 6, 2019

Uses the built-in categorical series mapper when mapping categorical series with dictionaries or series (i.e., as opposed to with lambada functions) instead of reindexing in order to improve performance.

benchmarking with asv

I added some new benchmarks to test mapping performance -- performance improves substantially when mapping with dicts or series against categorical series

       before           after         ratio
     [181f972d]       [184feb77]
     <master~1>                 
-     1.90±0.01ms          976±8μs     0.51  series_methods.Map.time_map('dict', 'category')
-     1.33±0.01ms          418±2μs     0.31  series_methods.Map.time_map('Series', 'category')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

benchmarking using example from #23785

upstream master

import pandas as pd
print(pd.__version__)
# 0.24.2
x = pd.Series(list('abcd') * 1000).astype('category')
mapper = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
%timeit x.map(mapper)
# 904 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x.map(lambda a: mapper[a])
# 254 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

this diff

import pandas as pd
print(pd.__version__)
# 0.25.0.dev0+379.g184feb77e
x = pd.Series(list('abcd') * 1000).astype('category')
mapper = {'a': 1, 'b': 2, 'c': 3, 'd': 4}
%timeit x.map(mapper)
# 622 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x.map(lambda a: mapper[a])
# 256 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@jreback jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Apr 7, 2019
@jreback
Copy link
Contributor

jreback commented Apr 9, 2019

can you merge master and update

@rtlee9
Copy link
Contributor Author

rtlee9 commented Apr 9, 2019 via email

@rtlee9 rtlee9 force-pushed the series_map_category_perf branch from f77d00d to 4559993 Compare April 10, 2019 04:16
@codecov
Copy link

codecov bot commented Apr 10, 2019

Codecov Report

Merging #26015 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26015      +/-   ##
==========================================
- Coverage   91.84%   91.81%   -0.03%     
==========================================
  Files         175      175              
  Lines       52517    52553      +36     
==========================================
+ Hits        48232    48254      +22     
- Misses       4285     4299      +14
Flag Coverage Δ
#multiple 90.38% <100%> (-0.02%) ⬇️
#single 40.72% <50%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/base.py 97.79% <100%> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/compat/__init__.py 75.83% <0%> (-2.08%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️
pandas/tseries/holiday.py 93.17% <0%> (-0.04%) ⬇️
pandas/plotting/_core.py 83.85% <0%> (-0.02%) ⬇️
pandas/tseries/frequencies.py 97.68% <0%> (-0.01%) ⬇️
pandas/tseries/offsets.py 96.69% <0%> (-0.01%) ⬇️
pandas/core/generic.py 93.54% <0%> (ø) ⬆️
pandas/core/apply.py 98.61% <0%> (ø) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d66bdf...4559993. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 10, 2019

Codecov Report

Merging #26015 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26015      +/-   ##
==========================================
- Coverage   91.98%   91.97%   -0.01%     
==========================================
  Files         175      175              
  Lines       52376    52378       +2     
==========================================
- Hits        48178    48175       -3     
- Misses       4198     4203       +5
Flag Coverage Δ
#multiple 90.53% <100%> (-0.01%) ⬇️
#single 40.72% <50%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/base.py 97.98% <100%> (-0.22%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6324be...3292621. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

can you merge master

@rtlee9 rtlee9 force-pushed the series_map_category_perf branch 2 times, most recently from 5a07694 to 9d5f39a Compare April 22, 2019 05:21
@jreback jreback added this to the 0.25.0 milestone Apr 28, 2019
@jreback
Copy link
Contributor

jreback commented Apr 28, 2019

looks good. can you merge master; ping on green.

@rtlee9 rtlee9 force-pushed the series_map_category_perf branch from 9d5f39a to 7ab8e6d Compare April 29, 2019 05:27
@rtlee9
Copy link
Contributor Author

rtlee9 commented Apr 30, 2019

@jreback merged master and all checks green

@jreback
Copy link
Contributor

jreback commented Apr 30, 2019

can you merge master; ping on green.

@rtlee9 rtlee9 force-pushed the series_map_category_perf branch from cc49da6 to 3292621 Compare April 30, 2019 15:02
@rtlee9
Copy link
Contributor Author

rtlee9 commented Apr 30, 2019

@jreback merged and green

@jreback jreback merged commit 80be9b5 into pandas-dev:master May 1, 2019
@jreback
Copy link
Contributor

jreback commented May 1, 2019

thanks @rtlee9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: series.map(arg) for category is slow if arg is a dict
2 participants