DOC: better explain the automatic alignment process #49939

MarcoGorelli · 2022-11-28T08:52:55Z

Pandas version checks

I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

Throughtout https://pandas.pydata.org/pandas-docs/stable/index.html

Documentation problem

I'd like to mention documentation again, for emphasis. A someone who found the automatic data alignment 'confusing', I think I offer a unique perspective to this problem (at least on this thread). For example, I'm currently dealing with numpy shape. When looking at the documentation for np.shape it would have been helpful if it simply mentioned that it was row X columns when I looked np.shape rather the just "array dimensions" (same amount of characters!). Smirk, if you wish, but these minor details here and there repeated can go along way to helping folks new to these frameworks out.

Originally reported by @blazespinnaker here

Suggested fix for documentation

For this problem, a quick note in different operations such as sum could say

*Note automatic data alignment: as with all pandas operations, automatic data alignment is performed. If sum does not find values with matching indicies than NaN will used as the total.

An example would even be even more awesome.

Note that my text is probably very poor, but ideally it would be written in a way to make alignment less confusing to new users.

Originally reported by @blazespinnaker here

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2022-11-28T23:44:42Z

I wouldn’t object to deprecating automatic alignment

MarcoGorelli · 2022-11-29T13:38:00Z

I wouldn’t object to deprecating automatic alignment

Really? I hadn't even realised this was on the table. If so, then I love this idea.

Would the idea be:

if indices are already aligned, proceed as per status quo
if they're not aligned, then throw an error, advising users to call .align?

Like this, then advanced users who really use the power of indices just need to add an extra .align call, and beginner/intermediate users won't have surprises because of unexpected "magical" alignment under the hood. Similarly to #49946

This would also be similar to @blazespinnaker 's comment #49694 (comment) , except that instead of there being a global option to control this, users would get a loud and clear error. In which case, thanks @blazespinnaker , and I'm sorry for having said that your comment was off-topic (I still think it's better to keep this discussion separate from PDEP0005 though)

Example of where to get to:

>>> ser1 = pd.Series([1,2,3])
>>> ser2 = pd.Series([4, 1, 2], index=[0, 1, 3])
>>> ser1 + ser2
---
ValueError: Operands are not aligned. Do `left, right = left.align(right, axis=0, copy=False)` before operating.

MarcoGorelli · 2022-11-30T11:02:34Z

I guess it's time for another @pandas-dev/pandas-core @pandas-dev/pandas-triage tag ... before putting together another PDEP, anyone have any initial thoughts on deprecating automatic alignment?

I like the idea, as it would mean:

simplifying the codebase
fewer surprises for users
advanced users relying on automatic alignment can just add an extra .align call and proceed as before

Dr-Irv · 2022-11-30T14:27:50Z

Related #47554

One thing to consider is that this would mess up doing chaining. For example, right now, you can do (s1 + s2).dropna(), but if you have to align first, I don't see how to do that in a chain.

There are some advantages to automatic alignment, in terms of when exploring data, it helps you identify missing data quite easily.

MarcoGorelli · 2022-11-30T14:56:17Z

For chaining, you could do

functools.reduce(lambda lhs, rhs: lhs + rhs, s1.align(s2)).dropna()

This is kinda advanced, perhaps, but I think it's only advanced users that intentionally rely on automated alignment anyway.

For identifying missing data, I think this would be even better - a loud and clear error immediately alerts you of missing data, whereas silent automated alignment introduces NaNs which propagate throughout subsequent operations and can go undetected

rhshadrach · 2022-12-01T02:55:52Z

I'm pretty negative on deprecating this, to me auto-alignment is one of the key features of pandas that makes data wrangling significantly easier. I think having functools.reduce(lambda lhs, rhs: lhs + rhs, s1.align(s2)) as an alternative to s1 + s2 highlights this.

In my use, data at various levels of aggregation for products can be uniquely indexed by (location, product id, time). Various data is specified by a subset of these. For example, product description or size is by (product id, ) alone; a forecast model will have parameters that depend only on (location, product id); price of a product on (location, product id, time); manufacturer on (product id, time). By using these as a multiindex, one can combine various data sources naturally. For example:

df1 = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': [6, 7, 8]}).set_index(['a', 'b'])
df2 = pd.DataFrame({'a': [1, 2], 'd': [10, 100]}).set_index('a')
result = df1['c'] * df2['d']
print(result)
# a  b
# 1  3     60
#    4     70
# 2  5    800
# dtype: int64

This is natural and pleasant. Without this, each operation needs to first join to a temporary frame.

MarcoGorelli · 2022-12-01T09:20:54Z

That's a nice example, thanks Richard!

An alternative suggested by Brock was to only deprecate auto-alignment in dunder operations. So in your example,

>>> df1['c'] * df2['d']
ValueError: Operands are not aligned, use align before operating or use .mul instead of *

work throw an error, but

df1['c'].mul(df2['d'])

would work as it currently does

This would retain the natural and pleasant functionality for advanced users, whilst ending up with fewer surprises for others

rhshadrach · 2022-12-04T18:17:17Z

Thanks @MarcoGorelli - while still having a way use these dunders with alignment does make me less negative to this proposal, I still think there are issues. Please correct any of these if they are wrong!

There would be no "natural" equivalent to df["a"] + df["b"] when a user wants to the operation while method chaining
There would be an inconsistent behavior between df.__add__ and df.add.
By only having alignment via .add et al, the user is forced into functional syntax if they want alignment. This can quite unreadable for even moderately complex formulae.

More generally, it seems to me the main motivation behind this proposal is that it would make pandas easier for new users. If there are other motivations (@jbrockmendel - I'm curious what your motivation is in particular), I think they would be good to identify. While I'm all for making pandas easier new users, I do not think we should be doing so at the expense of expert usage.

jbrockmendel · 2022-12-05T18:09:54Z

(@jbrockmendel - I'm curious what your motivation is in particular)

I am not actively advocating the idea. I suggested it as an alternative to the NoIndex-mode given that the motivation seemed to be "automatic alignment is a major pain point".

blazespinnaker · 2022-12-28T13:11:40Z

A place where documentation could be very helpful is in auto alignment and correlation. The results can be very surprising at their confidence even though what you're doing makes no sense at all.

MarcoGorelli · 2022-12-28T13:53:43Z

Thanks @rhshadrach , those are some valid points

You're right about what my main motivation is

Maybe we just need to document this more clearly, with a visible note in every operation which aligns redirecting to some page in the user guide

I've updated this to be a docs issue. If anyone would like to work on it, please do comment, happy to help out, it would be good to get better docs on this one

EltarrLok · 2023-04-01T22:37:36Z

take

rsm-23 · 2023-06-30T16:00:08Z

Hey @MarcoGorelli I can take this up. I would need some guidance though :)

MarcoGorelli · 2023-07-02T12:20:36Z

nice, thanks! I think all that's needed are some notes saying that in general, pandas operations align on the index, perhaps starting with .corr (which was part of the motivation for the issue)

rsm-23 · 2023-07-02T14:28:47Z

take

rsm-23 · 2023-07-02T14:52:59Z

@MarcoGorelli once this looks fine, I can create PRs for other functions.

rsm-23 · 2023-07-03T08:37:23Z

@MarcoGorelli please review the PR when you get time.

MarcoGorelli added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 28, 2022

MarcoGorelli mentioned this issue Nov 28, 2022

PDEP-5: NoRowIndex #49694

Merged

MarcoGorelli changed the title ~~DOC: better explain the alignment process~~ API/DOC: better explain (or deprecate?) the automatic alignment process Nov 29, 2022

MarcoGorelli removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 28, 2022

MarcoGorelli changed the title ~~API/DOC: better explain (or deprecate?) the automatic alignment process~~ DOC: better explain the automatic alignment process Mar 31, 2023

MarcoGorelli added the good first issue label Mar 31, 2023

github-actions bot assigned EltarrLok Apr 1, 2023

github-actions bot assigned rsm-23 Jul 2, 2023

rsm-23 mentioned this issue Jul 2, 2023

DOC: Added note for corr #53972

Merged

3 tasks

MarcoGorelli closed this as completed in #53972 Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: better explain the automatic alignment process #49939

DOC: better explain the automatic alignment process #49939

MarcoGorelli commented Nov 28, 2022

jbrockmendel commented Nov 28, 2022

MarcoGorelli commented Nov 29, 2022

MarcoGorelli commented Nov 30, 2022

Dr-Irv commented Nov 30, 2022

MarcoGorelli commented Nov 30, 2022 •

edited

Loading

rhshadrach commented Dec 1, 2022 •

edited

Loading

MarcoGorelli commented Dec 1, 2022

rhshadrach commented Dec 4, 2022

jbrockmendel commented Dec 5, 2022

blazespinnaker commented Dec 28, 2022

MarcoGorelli commented Dec 28, 2022 •

edited

Loading

EltarrLok commented Apr 1, 2023

rsm-23 commented Jun 30, 2023

MarcoGorelli commented Jul 2, 2023

rsm-23 commented Jul 2, 2023

rsm-23 commented Jul 2, 2023

rsm-23 commented Jul 3, 2023

DOC: better explain the automatic alignment process #49939

DOC: better explain the automatic alignment process #49939

Comments

MarcoGorelli commented Nov 28, 2022

Pandas version checks

Location of the documentation

Documentation problem

Suggested fix for documentation

jbrockmendel commented Nov 28, 2022

MarcoGorelli commented Nov 29, 2022

MarcoGorelli commented Nov 30, 2022

Dr-Irv commented Nov 30, 2022

MarcoGorelli commented Nov 30, 2022 • edited Loading

rhshadrach commented Dec 1, 2022 • edited Loading

MarcoGorelli commented Dec 1, 2022

rhshadrach commented Dec 4, 2022

jbrockmendel commented Dec 5, 2022

blazespinnaker commented Dec 28, 2022

MarcoGorelli commented Dec 28, 2022 • edited Loading

EltarrLok commented Apr 1, 2023

rsm-23 commented Jun 30, 2023

MarcoGorelli commented Jul 2, 2023

rsm-23 commented Jul 2, 2023

rsm-23 commented Jul 2, 2023

rsm-23 commented Jul 3, 2023

MarcoGorelli commented Nov 30, 2022 •

edited

Loading

rhshadrach commented Dec 1, 2022 •

edited

Loading

MarcoGorelli commented Dec 28, 2022 •

edited

Loading