Skip to content

ENH: Allow parameters method and min_periods in DataFrame.corrwith() #15573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

anthonyho
Copy link

@anthonyho anthonyho commented Mar 4, 2017

Added new keyword parameters for DataFrame.corrwith(), which allows methods other than Pearson to be used. See #9490.

@@ -157,6 +157,7 @@ objects.
df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)
df1.corrwith(df2)
df2.corrwith(df1, axis=1)
df2.corrwith(df1, axis=1, method='kendall')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add versionsddes tag (and small comment here)


correl = num / dom
correl = Series({col: nanops.nancorr(left[col].values,
right[col].values,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be very slow

we need to rework nancorr to do this instead

Copy link
Author

@anthonyho anthonyho Mar 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new implementation (which calls nancorr which in turns calls numpy/scipy correlation functions) is actually significantly faster than the current implementation (manually computing Pearson correlation using DataFrame.mean(), DataFrame.sum(), and DataFrame.std())

For example:

Current implementation:

>>> import pandas as pd; import timeit
>>> pd.__version__
u'0.19.2'
>>> iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
>>> timeit.timeit(lambda: iris.corrwith(iris), number=10000)
50.891642808914185
>>> timeit.timeit(lambda: iris.T.corrwith(iris.T), number=10000)
42.0677649974823

New implementation:

>>> import pandas as pd; import timeit
>>> pd.__version__
'0.19.0+539.g0b77680.dirty'
>>> iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
>>> timeit.timeit(lambda: iris.corrwith(iris, method='pearson'), number=10000)
28.622286081314087
>>> timeit.timeit(lambda: iris.T.corrwith(iris.T, method='pearson'), number=10000)
21.898916959762573

I'm pretty new to this, so please let me know if I'm missing anything here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look thru the benchmarks and pls add some asv as appropriate

include wide and talk data

on wide data this will be slower

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

can you update

@jreback
Copy link
Contributor

jreback commented May 7, 2017

can you rebase, add some benchmarks to asv and show them.

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Enhancement labels May 7, 2017
@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

closing as stale

@jreback jreback closed this Aug 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow method keyword for DataFrame.corrwith()
2 participants