Skip to content

Pd.series.map performance #34948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 13, 2020
Merged

Conversation

Rohith295
Copy link
Contributor

@Rohith295 Rohith295 commented Jun 23, 2020

There are other places also to refactor to improve the performance, but this current change has greater impact aswell.

@topper-123
Copy link
Contributor

Yeah, zip is surprisingly slow, thought I think simply

keys = tuple(data.keys())
values = tuple(data.values())

should be even faster than your approach. Can you compare this to your PR.

Can you also give perf. comparisons relative to master.

@topper-123 topper-123 added the Performance Memory or execution speed performance label Jun 23, 2020
@topper-123 topper-123 added this to the 1.1 milestone Jun 23, 2020
@topper-123 topper-123 added Series Series data structure Constructors Series/DataFrame/Index/pd.array Constructors labels Jun 23, 2020
@Rohith295
Copy link
Contributor Author

Yeah, zip is surprisingly slow, thought I think simply

keys = tuple(data.keys())
values = tuple(data.values())

should be even faster than your approach. Can you compare this to your PR.

Can you also give perf. comparisons relative to master.

values = tuple(data.values()) --> this needs to be returned in the order of keys. If i use in the given way the tests are failing

@charlesdong1991
Copy link
Member

charlesdong1991 commented Jun 23, 2020

what about using list instead of tuple for both, will it respect the order of keys?

@Rohith295
Copy link
Contributor Author

Rohith295 commented Jun 23, 2020

It is kind of expecting keys as tuple and values as list. I can see tests passing. And values order should be in the same form as keys. In the sense data = {1:2,3:2,:4:5}, keys = tuple(3,4,1) and values = [2,5,2]. But anyhow i see if we try keys = tuple(data.keys()) and values = list(data.values()). This should be fine i think. Let me see and validate the tests

@charlesdong1991
Copy link
Member

emm, why does keys have to be tuple? i just saw the else clause below your change in the file, where:

keys, values = [], []

so maybe no need for keys to be tuple?

sorry for commenting this without looking at the codebase in details and apologize in advance if it has to be tuple.

@Rohith295
Copy link
Contributor Author

If you look at the code history.

>keys,values = zip(*data.items())
> values = list(values)

If you look at the above piece of code. Zip returns tuples of keys and values. But in the second line values are being converted to list. But i agree even in the else he is assigning empty list for both keys and values.

@charlesdong1991
Copy link
Member

well, I think it's fine to leave both to be list

but just follow your own idea to make the changes, and you might also need to add a whatsnew and an example in asv for performance comp.

@Rohith295
Copy link
Contributor Author

well, I think it's fine to leave both to be list

but just follow your own idea to make the changes, and you might also need to add a whatsnew and an example in asv for performance comp.

Yeah. I will look into it and try to generate the perf comp accordingly.

@Rohith295
Copy link
Contributor Author

@charlesdong1991 Any idea how to deal with the breaking type validation

@charlesdong1991
Copy link
Member

charlesdong1991 commented Jun 23, 2020

emm, you meant CI? seem you have list changed to tuple. can you try using list for both and see if this error is gone?

@pep8speaks
Copy link

pep8speaks commented Jun 23, 2020

Hello @Rohith295! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-13 15:32:24 UTC

@Rohith295
Copy link
Contributor Author

emm, you meant CI? seem you have list changed to tuple. can you try using list for both and see if this error is gone?

Alright. I will check it accordinly.

@Rohith295
Copy link
Contributor Author

but just follow your own idea to make the changes, and you might also need to add a whatsnew and an example in asv for performance comp.

I am trying to generate the performance metrics. Can you suggest me some practices for generating it properly?

@charlesdong1991
Copy link
Member

charlesdong1991 commented Jun 23, 2020

@Rohith295
Copy link
Contributor Author

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah pls provide an asv for this case and also run the construction asvs to make sure we don't have any regressions elsewhere

@Rohith295
Copy link
Contributor Author

yeah pls provide an asv for this case and also run the construction asvs to make sure we don't have any regressions elsewhere

Alright. I will fix the review comments and do provide the asv

@topper-123
Copy link
Contributor

topper-123 commented Jun 23, 2020

values = tuple(data.values()) --> this needs to be returned in the order of keys. If i use in the given way the tests are failing

@Rohith295, is the failing test tests/series/methods/test_cov_corr.py::TestSeriesCorr::test_corr? That one fails for me after I apply my proposal if I run the test suite using xdist, but passes if I run the tests sequentially or only run that individual test. This could hint that there is a problem with isolation of e.g. fixtures in the pandas test suite.

The python docs says:

keys and values are iterated over in insertion order.

so values = data.values() should work, AFAICS.

@Rohith295
Copy link
Contributor Author

values = tuple(data.values()) --> this needs to be returned in the order of keys. If i use in the given way the tests are failing

@Rohith295, is the failing test tests/series/methods/test_cov_corr.py::TestSeriesCorr::test_corr? That one fails for me after I apply my proposal if I run the test suite using xdist, but passes if I run the tests sequentially or only run that individual test. This could hint that there is a problem with isolation of e.g. fixtures in the pandas test suite.

The python docs says:

keys and values are iterated over in insertion order.

so values = data.values() should work, AFAICS.

It's not only that one particular test, there were other tests aswell. I did not make a particular note of it of tests which are failing. Also as you mentioned data.values() gives output in particular order. I still don't want rely on it because it only implemented from python 3.6, but what if someone using older versions. I believe our code must be both backward and forward compatible.

@topper-123
Copy link
Contributor

I still don't want rely on it because it only implemented from python 3.6, but what if someone using older versions

pandas supports python 3.6.1 and higher, so that particular concern is not a problem. But ok if tuple(data.values()) doesn't work currently, I'm fine with that issue being outside th scope of this PR.

@jreback jreback removed this from the 1.1 milestone Jun 25, 2020
@jreback
Copy link
Contributor

jreback commented Jun 25, 2020

we need an asv & a whatsnew note in order to merge this.

@Rohith295
Copy link
Contributor Author

I will do it by today mostly. Sorry for the delay @jreback

@Rohith295
Copy link
Contributor Author

Rohith295 commented Jun 27, 2020

Guys, i am trying to build asv. If i start doing asv run. It is taking a long time by executing all the files under benchmarks directory. Do any one of you know just make the specific file execute under benchmarks directory to capture the asv file. @jreback @topper-123 @charlesdong1991

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u can add a whats new note and an asv that captures this case

you can run just other relevant asvs (if we have any for map)

and show the results of this one

merge master and ping on green

@Rohith295
Copy link
Contributor Author

if u can add a whats new note and an asv that captures this case

you can run just other relevant asvs (if we have any for map)

and show the results of this one

merge master and ping on green

Alright. I am on it..

@Rohith295
Copy link
Contributor Author

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building 5dc5004 <pd.Series.map_performance> for conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing 5dc5004 <pd.Series.map_performance> into conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[ 0.00%] · For pandas commit 04e9e0a <pd.Series.map_performance^2> (round 1/2):
[ 0.00%] ·· Building for conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running (series_methods.Map.time_map--).
[ 25.00%] · For pandas commit 5dc5004 <pd.Series.map_performance> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (series_methods.Map.time_map--).
[ 50.00%] · For pandas commit 5dc5004 <pd.Series.map_performance> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· series_methods.Map.time_map ok
[ 75.00%] ··· ======== ============ ============ =============
-- a
-------- ---------------------------------------
m object category int
======== ============ ============ =============
dict 3.20±0.1ms 1.95±0.1ms 1.72±0.03ms
Series 2.27±0.1ms 936±100μs 728±90μs
lambda 6.46±0.2ms 1.29±0.2ms 8.01±1ms
======== ============ ============ =============

[ 75.00%] · For pandas commit 04e9e0a <pd.Series.map_performance^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· series_methods.Map.time_map ok
[100.00%] ··· ======== ============ ============= ============
-- a
-------- ---------------------------------------
m object category int
======== ============ ============= ============
dict 3.87±0.6ms 2.44±0.2ms 2.93±0.2ms
Series 3.42±1ms 1.01±0.07ms 1.07±0.5ms
lambda 6.08±0.5ms 1.60±0.1ms 6.84±1ms
======== ============ ============= ============

   before           after         ratio
 [04e9e0af]       [5dc5004e]
 <pd.Series.map_performance^2>       <pd.Series.map_performance>
  •  2.44±0.2ms       1.95±0.1ms     0.80  series_methods.Map.time_map('dict', 'category')
    
  •  2.93±0.2ms      1.72±0.03ms     0.59  series_methods.Map.time_map('dict', 'int')
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@Rohith295
Copy link
Contributor Author

@jreback I added the asv benchmarking output here. Is this what expected?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have sufficient asv's for this?

can you add a note in 1.2 perf section.

@Rohith295
Copy link
Contributor Author

@jreback yes we do have sufficient asv's. Also added the comments as requested and also the release note under perf section.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comment in whatsnew, ping on green.

@jreback jreback added this to the 1.2 milestone Sep 13, 2020
@jreback jreback merged commit 00353d5 into pandas-dev:master Sep 13, 2020
@jreback
Copy link
Contributor

jreback commented Sep 13, 2020

thanks @Rohith295

@Rohith295
Copy link
Contributor Author

Thanks @jreback . I have learned so much from your review comments and also got to chance to understand more about open source community.

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Performance Memory or execution speed performance Series Series data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: pd.Series.map too slow for huge dictionary
6 participants