ENH: ignore_index for Series corr #49617

blazespinnaker · 2022-11-10T15:02:37Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Sometimes you can get pretty strange results with simple operations on Series.corr(Series) because of index mismatches. An ignore_index would be very useful.

Feature Description

eg, s1.corr(s2, ignore_index = True)

Alternative Solutions

None

Additional Context

No response

phofl · 2022-11-16T11:15:12Z

Hi, thanks for your report. The keyword should ignore the index during the operation not in the result? In this case we need a different name, since ignore_index has a different meaning in pandas. Can you show an example?

blazespinnaker · 2022-11-17T00:13:15Z

print(n1)
print(n2)
print(type(n1), type(n2))
print(scipy.stats.spearmanr(n1, n2))
print(n1.corr(n2, method="spearman"))
0    2317.0
1    2293.0
2    1190.0
3     972.0
4    1391.0
Name: r6000, dtype: float64
0.0    2317.0
1.0    2293.0
3.0    1190.0
4.0     972.0
5.0    1391.0
Name: 6000, dtype: float64
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
SpearmanrResult(correlation=0.9999999999999999, pvalue=1.4042654220543672e-24)
0.7999999999999999

Here's an example of the issue. Another approach might be to spit out a warning, though that can get noisy. The benefit of a parameter to not do the automatic alignment is that it serves as a warning. A final approach would just be a note in the documentation.

jreback · 2022-11-17T00:25:08Z

all operations in pandas align
if you want to not align the use .to_numpy()

-1 on changing anything - this would be a very special case

blazespinnaker · 2022-11-17T21:02:36Z

I think reset_index() is actually the right answer here. to_numpy() would require scipy.

Your point is taken, but I guess what got me was that all operations do appear to sort of align but in rather arbitrary ways, leaving things quite confusing.

eg:

n1 = pd.DataFrame([(1,2),(2,4),(3,5)], columns = ['ind', 'val']).set_index('ind')
n2 = pd.DataFrame([(1,2),(5,4),(3,5)], columns = ['ind', 'val']).set_index('ind')
display(n1+n2)
display(n1['val']+n2['val'])

If it were consistent with corr, than n1+n2 should be a new DF with only the agreed upon indexes, but instead the missing val is NaN.

If corr followed the same logic consistently, it would give a NaN correlation. When I saw a seemingly valid correlation I assumed everything was correct, which of course was mistaken.

A short note in the docs describing the assumptions made would at least help a bit.

blazespinnaker · 2022-11-17T21:16:07Z

Some other examples of inconsistent behavior with data alignment. I appreciate the tradeoffs made and even why they might be optimal, but a little bit of extra documentation here and there would help educate users like me better I think.

#20831
#47554

MarcoGorelli · 2022-11-18T19:22:58Z

@blazespinnaker you might be interested in PDEP5, which may (mind you I said may, it's not yet been accepted, and it's still being ironed out) allow you to not need to think about alignment if you don't want to

Closing then as I don't think there's anything actionable here - regarding clarifying docs, PRs to improve them are welcome, feel free to submit one https://pandas.pydata.org/docs/dev/development/contributing.html

blazespinnaker added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 10, 2022

MarcoGorelli closed this as completed Nov 18, 2022

blazespinnaker mentioned this issue Nov 21, 2022

PDEP-5: NoRowIndex #49694

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: ignore_index for Series corr #49617

ENH: ignore_index for Series corr #49617

blazespinnaker commented Nov 10, 2022

phofl commented Nov 16, 2022

blazespinnaker commented Nov 17, 2022

jreback commented Nov 17, 2022

blazespinnaker commented Nov 17, 2022 •

edited

Loading

blazespinnaker commented Nov 17, 2022 •

edited

Loading

MarcoGorelli commented Nov 18, 2022 •

edited

Loading

ENH: ignore_index for Series corr #49617

ENH: ignore_index for Series corr #49617

Comments

blazespinnaker commented Nov 10, 2022

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

phofl commented Nov 16, 2022

blazespinnaker commented Nov 17, 2022

jreback commented Nov 17, 2022

blazespinnaker commented Nov 17, 2022 • edited Loading

blazespinnaker commented Nov 17, 2022 • edited Loading

MarcoGorelli commented Nov 18, 2022 • edited Loading

blazespinnaker commented Nov 17, 2022 •

edited

Loading

blazespinnaker commented Nov 17, 2022 •

edited

Loading

MarcoGorelli commented Nov 18, 2022 •

edited

Loading