Pd.series.map performance #34948

Rohith295 · 2020-06-23T10:06:25Z

There are other places also to refactor to improve the performance, but this current change has greater impact aswell.

closes PERF: pd.Series.map too slow for huge dictionary #34717
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry improved the performance of pd.series.map

…e performance

topper-123 · 2020-06-23T11:01:39Z

Yeah, zip is surprisingly slow, thought I think simply

keys = tuple(data.keys())
values = tuple(data.values())

should be even faster than your approach. Can you compare this to your PR.

Can you also give perf. comparisons relative to master.

Rohith295 · 2020-06-23T11:26:39Z

Yeah, zip is surprisingly slow, thought I think simply
keys = tuple(data.keys())
values = tuple(data.values())
should be even faster than your approach. Can you compare this to your PR.

Can you also give perf. comparisons relative to master.

values = tuple(data.values()) --> this needs to be returned in the order of keys. If i use in the given way the tests are failing

charlesdong1991 · 2020-06-23T11:38:20Z

what about using list instead of tuple for both, will it respect the order of keys?

Rohith295 · 2020-06-23T12:06:10Z

It is kind of expecting keys as tuple and values as list. I can see tests passing. And values order should be in the same form as keys. In the sense data = {1:2,3:2,:4:5}, keys = tuple(3,4,1) and values = [2,5,2]. But anyhow i see if we try keys = tuple(data.keys()) and values = list(data.values()). This should be fine i think. Let me see and validate the tests

charlesdong1991 · 2020-06-23T12:11:58Z

emm, why does keys have to be tuple? i just saw the else clause below your change in the file, where:

keys, values = [], []

so maybe no need for keys to be tuple?

sorry for commenting this without looking at the codebase in details and apologize in advance if it has to be tuple.

Rohith295 · 2020-06-23T12:15:42Z

If you look at the code history.

>keys,values = zip(*data.items())
> values = list(values)

If you look at the above piece of code. Zip returns tuples of keys and values. But in the second line values are being converted to list. But i agree even in the else he is assigning empty list for both keys and values.

charlesdong1991 · 2020-06-23T12:20:33Z

well, I think it's fine to leave both to be list

but just follow your own idea to make the changes, and you might also need to add a whatsnew and an example in asv for performance comp.

Rohith295 · 2020-06-23T12:24:34Z

well, I think it's fine to leave both to be list

but just follow your own idea to make the changes, and you might also need to add a whatsnew and an example in asv for performance comp.

Yeah. I will look into it and try to generate the perf comp accordingly.

Rohith295 · 2020-06-23T13:23:26Z

@charlesdong1991 Any idea how to deal with the breaking type validation

charlesdong1991 · 2020-06-23T13:29:05Z

emm, you meant CI? seem you have list changed to tuple. can you try using list for both and see if this error is gone?

pep8speaks · 2020-06-23T13:39:08Z

Hello @Rohith295! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-13 15:32:24 UTC

Rohith295 · 2020-06-23T13:44:53Z

emm, you meant CI? seem you have list changed to tuple. can you try using list for both and see if this error is gone?

Alright. I will check it accordinly.

…formance

Rohith295 · 2020-06-23T14:41:43Z

but just follow your own idea to make the changes, and you might also need to add a whatsnew and an example in asv for performance comp.

I am trying to generate the performance metrics. Can you suggest me some practices for generating it properly?

charlesdong1991 · 2020-06-23T14:43:52Z

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

you can also add a test for this case in asv_bench.

Rohith295 · 2020-06-23T14:46:44Z

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

Alright. I will have a look into this asv

jreback

yeah pls provide an asv for this case and also run the construction asvs to make sure we don't have any regressions elsewhere

pandas/core/series.py

Rohith295 · 2020-06-23T19:17:52Z

yeah pls provide an asv for this case and also run the construction asvs to make sure we don't have any regressions elsewhere

Alright. I will fix the review comments and do provide the asv

topper-123 · 2020-06-23T21:17:24Z

values = tuple(data.values()) --> this needs to be returned in the order of keys. If i use in the given way the tests are failing

@Rohith295, is the failing test tests/series/methods/test_cov_corr.py::TestSeriesCorr::test_corr? That one fails for me after I apply my proposal if I run the test suite using xdist, but passes if I run the tests sequentially or only run that individual test. This could hint that there is a problem with isolation of e.g. fixtures in the pandas test suite.

The python docs says:

keys and values are iterated over in insertion order.

so values = data.values() should work, AFAICS.

Rohith295 · 2020-06-23T21:29:42Z

values = tuple(data.values()) --> this needs to be returned in the order of keys. If i use in the given way the tests are failing

@Rohith295, is the failing test tests/series/methods/test_cov_corr.py::TestSeriesCorr::test_corr? That one fails for me after I apply my proposal if I run the test suite using xdist, but passes if I run the tests sequentially or only run that individual test. This could hint that there is a problem with isolation of e.g. fixtures in the pandas test suite.

The python docs says:

keys and values are iterated over in insertion order.

so values = data.values() should work, AFAICS.

It's not only that one particular test, there were other tests aswell. I did not make a particular note of it of tests which are failing. Also as you mentioned data.values() gives output in particular order. I still don't want rely on it because it only implemented from python 3.6, but what if someone using older versions. I believe our code must be both backward and forward compatible.

topper-123 · 2020-06-23T21:50:34Z

I still don't want rely on it because it only implemented from python 3.6, but what if someone using older versions

pandas supports python 3.6.1 and higher, so that particular concern is not a problem. But ok if tuple(data.values()) doesn't work currently, I'm fine with that issue being outside th scope of this PR.

jreback · 2020-06-25T23:09:41Z

we need an asv & a whatsnew note in order to merge this.

Rohith295 · 2020-06-26T07:24:12Z

I will do it by today mostly. Sorry for the delay @jreback

Rohith295 · 2020-06-27T15:24:39Z

Guys, i am trying to build asv. If i start doing asv run. It is taking a long time by executing all the files under benchmarks directory. Do any one of you know just make the specific file execute under benchmarks directory to capture the asv file. @jreback @topper-123 @charlesdong1991

…formance

jreback

if u can add a whats new note and an asv that captures this case

you can run just other relevant asvs (if we have any for map)

and show the results of this one

merge master and ping on green

…formance

Rohith295 · 2020-07-25T08:46:47Z

if u can add a whats new note and an asv that captures this case

you can run just other relevant asvs (if we have any for map)

and show the results of this one

merge master and ping on green

Alright. I am on it..

Rohith295 · 2020-07-25T11:21:17Z

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building 5dc5004 <pd.Series.map_performance> for conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing 5dc5004 <pd.Series.map_performance> into conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pandas commit 04e9e0a <pd.Series.map_performance^2> (round 1/2):
[ 0.00%] ·· Building for conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running (series_methods.Map.time_map--).
[ 25.00%] · For pandas commit 5dc5004 <pd.Series.map_performance> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (series_methods.Map.time_map--).
[ 50.00%] · For pandas commit 5dc5004 <pd.Series.map_performance> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· series_methods.Map.time_map ok
[ 75.00%] ··· ======== ============ ============ =============
-- a
-------- ---------------------------------------
m object category int
======== ============ ============ =============
dict 3.20±0.1ms 1.95±0.1ms 1.72±0.03ms
Series 2.27±0.1ms 936±100μs 728±90μs
lambda 6.46±0.2ms 1.29±0.2ms 8.01±1ms
======== ============ ============ =============

[ 75.00%] · For pandas commit 04e9e0a <pd.Series.map_performance^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· series_methods.Map.time_map ok
[100.00%] ··· ======== ============ ============= ============
-- a
-------- ---------------------------------------
m object category int
======== ============ ============= ============
dict 3.87±0.6ms 2.44±0.2ms 2.93±0.2ms
Series 3.42±1ms 1.01±0.07ms 1.07±0.5ms
lambda 6.08±0.5ms 1.60±0.1ms 6.84±1ms
======== ============ ============= ============

   before           after         ratio
 [04e9e0af]       [5dc5004e]
 <pd.Series.map_performance^2>       <pd.Series.map_performance>

 2.44±0.2ms       1.95±0.1ms     0.80  series_methods.Map.time_map('dict', 'category')

 2.93±0.2ms      1.72±0.03ms     0.59  series_methods.Map.time_map('dict', 'int')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Rohith295 · 2020-07-25T11:21:59Z

@jreback I added the asv benchmarking output here. Is this what expected?

pandas/core/series.py

jreback

do we have sufficient asv's for this?

can you add a note in 1.2 perf section.

pandas/core/series.py

…formance

…d a note for the release v1.2

Rohith295 · 2020-09-13T09:43:50Z

@jreback yes we do have sufficient asv's. Also added the comments as requested and also the release note under perf section.

jreback

lgtm. minor comment in whatsnew, ping on green.

doc/source/whatsnew/v1.2.0.rst

jreback · 2020-09-13T20:28:48Z

thanks @Rohith295

Rohith295 · 2020-09-13T20:31:59Z

Thanks @jreback . I have learned so much from your review comments and also got to chance to understand more about open source community.

Rohith295 added 2 commits June 23, 2020 11:42

Changed the way we are generating tuple of keys/values to increase th…

b5e5b28

…e performance

Changed the way of generating the tuple

c6956af

topper-123 added the Performance Memory or execution speed performance label Jun 23, 2020

topper-123 added this to the 1.1 milestone Jun 23, 2020

topper-123 added Series Series data structure Constructors Series/DataFrame/Index/pd.array Constructors labels Jun 23, 2020

Fixing the failing type annotation checks and also the code comments

510bce6

Fixed mypy static type analysis issue

4655d7b

Rohith295 added 2 commits June 23, 2020 16:07

Fixed linting issues

6f7c242

Merge remote-tracking branch 'upstream/master' into pd.Series.map_per…

8d8160e

…formance

jreback requested changes Jun 23, 2020

View reviewed changes

pandas/core/series.py Outdated Show resolved Hide resolved

jreback removed this from the 1.1 milestone Jun 25, 2020

removed unnecessary comments

e523480

Merge remote-tracking branch 'upstream/master' into pd.Series.map_per…

17a0514

…formance

jreback requested changes Jul 17, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into pd.Series.map_per…

5dc5004

…formance

jbrockmendel reviewed Sep 12, 2020

View reviewed changes

pandas/core/series.py Show resolved Hide resolved

jreback requested changes Sep 12, 2020

View reviewed changes

pandas/core/series.py Show resolved Hide resolved

Rohith295 added 2 commits September 13, 2020 10:35

Merge remote-tracking branch 'upstream/master' into pd.Series.map_per…

48c9ad3

…formance

Added comments to explain more about the performance issue, Also adde…

242aacb

…d a note for the release v1.2

jreback requested changes Sep 13, 2020

View reviewed changes

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

jreback added this to the 1.2 milestone Sep 13, 2020

Rohith295 and others added 2 commits September 13, 2020 17:29

Merge branch 'master' into pd.Series.map_performance

8c9a277

Fixed as per review comments

adebd8c

jreback approved these changes Sep 13, 2020

View reviewed changes

jreback merged commit 00353d5 into pandas-dev:master Sep 13, 2020

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

Pd.series.map performance (pandas-dev#34948)

acde55d

Uh oh!

Pd.series.map performance #34948

Pd.series.map performance #34948

Uh oh!

Conversation

Rohith295 commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topper-123 commented Jun 23, 2020

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

charlesdong1991 commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rohith295 commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlesdong1991 commented Jun 23, 2020

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

charlesdong1991 commented Jun 23, 2020

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

charlesdong1991 commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-09-13 15:32:24 UTC

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

charlesdong1991 commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

topper-123 commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rohith295 commented Jun 23, 2020

Uh oh!

topper-123 commented Jun 23, 2020

Uh oh!

jreback commented Jun 25, 2020

Uh oh!

Rohith295 commented Jun 26, 2020

Uh oh!

Rohith295 commented Jun 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Rohith295 commented Jul 25, 2020

Uh oh!

Rohith295 commented Jul 25, 2020

Uh oh!

Rohith295 commented Jul 25, 2020

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rohith295 commented Sep 13, 2020

Uh oh!

Rohith295 commented Jun 23, 2020 •

edited

Loading

charlesdong1991 commented Jun 23, 2020 •

edited

Loading

Rohith295 commented Jun 23, 2020 •

edited

Loading

charlesdong1991 commented Jun 23, 2020 •

edited

Loading

pep8speaks commented Jun 23, 2020 •

edited

Loading

charlesdong1991 commented Jun 23, 2020 •

edited

Loading

topper-123 commented Jun 23, 2020 •

edited

Loading

Rohith295 commented Jun 27, 2020 •

edited

Loading