Skip to content

ENH: ignore_index for Series corr #49617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
blazespinnaker opened this issue Nov 10, 2022 · 6 comments
Closed
1 of 3 tasks

ENH: ignore_index for Series corr #49617

blazespinnaker opened this issue Nov 10, 2022 · 6 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@blazespinnaker
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Sometimes you can get pretty strange results with simple operations on Series.corr(Series) because of index mismatches. An ignore_index would be very useful.

Feature Description

eg, s1.corr(s2, ignore_index = True)

Alternative Solutions

None

Additional Context

No response

@blazespinnaker blazespinnaker added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 10, 2022
@phofl
Copy link
Member

phofl commented Nov 16, 2022

Hi, thanks for your report. The keyword should ignore the index during the operation not in the result? In this case we need a different name, since ignore_index has a different meaning in pandas. Can you show an example?

@blazespinnaker
Copy link
Author

print(n1)
print(n2)
print(type(n1), type(n2))
print(scipy.stats.spearmanr(n1, n2))
print(n1.corr(n2, method="spearman"))
0    2317.0
1    2293.0
2    1190.0
3     972.0
4    1391.0
Name: r6000, dtype: float64
0.0    2317.0
1.0    2293.0
3.0    1190.0
4.0     972.0
5.0    1391.0
Name: 6000, dtype: float64
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
SpearmanrResult(correlation=0.9999999999999999, pvalue=1.4042654220543672e-24)
0.7999999999999999

Here's an example of the issue. Another approach might be to spit out a warning, though that can get noisy. The benefit of a parameter to not do the automatic alignment is that it serves as a warning. A final approach would just be a note in the documentation.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2022

all operations in pandas align
if you want to not align the use .to_numpy()

-1 on changing anything - this would be a very special case

@blazespinnaker
Copy link
Author

blazespinnaker commented Nov 17, 2022

I think reset_index() is actually the right answer here. to_numpy() would require scipy.

Your point is taken, but I guess what got me was that all operations do appear to sort of align but in rather arbitrary ways, leaving things quite confusing.

eg:

n1 = pd.DataFrame([(1,2),(2,4),(3,5)], columns = ['ind', 'val']).set_index('ind')
n2 = pd.DataFrame([(1,2),(5,4),(3,5)], columns = ['ind', 'val']).set_index('ind')
display(n1+n2)
display(n1['val']+n2['val'])

If it were consistent with corr, than n1+n2 should be a new DF with only the agreed upon indexes, but instead the missing val is NaN.

If corr followed the same logic consistently, it would give a NaN correlation. When I saw a seemingly valid correlation I assumed everything was correct, which of course was mistaken.

A short note in the docs describing the assumptions made would at least help a bit.

@blazespinnaker
Copy link
Author

blazespinnaker commented Nov 17, 2022

Some other examples of inconsistent behavior with data alignment. I appreciate the tradeoffs made and even why they might be optimal, but a little bit of extra documentation here and there would help educate users like me better I think.

#20831
#47554

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Nov 18, 2022

@blazespinnaker you might be interested in PDEP5, which may (mind you I said may, it's not yet been accepted, and it's still being ironed out) allow you to not need to think about alignment if you don't want to

Closing then as I don't think there's anything actionable here - regarding clarifying docs, PRs to improve them are welcome, feel free to submit one https://pandas.pydata.org/docs/dev/development/contributing.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

4 participants