Skip to content

DOC: better explain the automatic alignment process #49939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
MarcoGorelli opened this issue Nov 28, 2022 · 17 comments · Fixed by #53972
Closed
1 task done

DOC: better explain the automatic alignment process #49939

MarcoGorelli opened this issue Nov 28, 2022 · 17 comments · Fixed by #53972

Comments

@MarcoGorelli
Copy link
Member

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

Throughtout https://pandas.pydata.org/pandas-docs/stable/index.html

Documentation problem

I'd like to mention documentation again, for emphasis. A someone who found the automatic data alignment 'confusing', I think I offer a unique perspective to this problem (at least on this thread). For example, I'm currently dealing with numpy shape. When looking at the documentation for np.shape it would have been helpful if it simply mentioned that it was row X columns when I looked np.shape rather the just "array dimensions" (same amount of characters!). Smirk, if you wish, but these minor details here and there repeated can go along way to helping folks new to these frameworks out.

Originally reported by @blazespinnaker here

Suggested fix for documentation

For this problem, a quick note in different operations such as sum could say

*Note automatic data alignment: as with all pandas operations, automatic data alignment is performed. If sum does not find values with matching indicies than NaN will used as the total.

An example would even be even more awesome.

Note that my text is probably very poor, but ideally it would be written in a way to make alignment less confusing to new users.

Originally reported by @blazespinnaker here

@MarcoGorelli MarcoGorelli added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 28, 2022
@jbrockmendel
Copy link
Member

I wouldn’t object to deprecating automatic alignment

@MarcoGorelli MarcoGorelli changed the title DOC: better explain the alignment process API/DOC: better explain (or deprecate?) the automatic alignment process Nov 29, 2022
@MarcoGorelli
Copy link
Member Author

I wouldn’t object to deprecating automatic alignment

Really? I hadn't even realised this was on the table. If so, then I love this idea.

Would the idea be:

  • if indices are already aligned, proceed as per status quo
  • if they're not aligned, then throw an error, advising users to call .align?

Like this, then advanced users who really use the power of indices just need to add an extra .align call, and beginner/intermediate users won't have surprises because of unexpected "magical" alignment under the hood. Similarly to #49946

This would also be similar to @blazespinnaker 's comment #49694 (comment) , except that instead of there being a global option to control this, users would get a loud and clear error. In which case, thanks @blazespinnaker , and I'm sorry for having said that your comment was off-topic (I still think it's better to keep this discussion separate from PDEP0005 though)

Example of where to get to:

>>> ser1 = pd.Series([1,2,3])
>>> ser2 = pd.Series([4, 1, 2], index=[0, 1, 3])
>>> ser1 + ser2
---
ValueError: Operands are not aligned. Do `left, right = left.align(right, axis=0, copy=False)` before operating.

@MarcoGorelli
Copy link
Member Author

I guess it's time for another @pandas-dev/pandas-core @pandas-dev/pandas-triage tag ... before putting together another PDEP, anyone have any initial thoughts on deprecating automatic alignment?

I like the idea, as it would mean:

  • simplifying the codebase
  • fewer surprises for users
  • advanced users relying on automatic alignment can just add an extra .align call and proceed as before

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 30, 2022

Related #47554

One thing to consider is that this would mess up doing chaining. For example, right now, you can do (s1 + s2).dropna(), but if you have to align first, I don't see how to do that in a chain.

There are some advantages to automatic alignment, in terms of when exploring data, it helps you identify missing data quite easily.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Nov 30, 2022

For chaining, you could do

functools.reduce(lambda lhs, rhs: lhs + rhs, s1.align(s2)).dropna()

This is kinda advanced, perhaps, but I think it's only advanced users that intentionally rely on automated alignment anyway.

For identifying missing data, I think this would be even better - a loud and clear error immediately alerts you of missing data, whereas silent automated alignment introduces NaNs which propagate throughout subsequent operations and can go undetected

@rhshadrach
Copy link
Member

rhshadrach commented Dec 1, 2022

I'm pretty negative on deprecating this, to me auto-alignment is one of the key features of pandas that makes data wrangling significantly easier. I think having functools.reduce(lambda lhs, rhs: lhs + rhs, s1.align(s2)) as an alternative to s1 + s2 highlights this.

In my use, data at various levels of aggregation for products can be uniquely indexed by (location, product id, time). Various data is specified by a subset of these. For example, product description or size is by (product id, ) alone; a forecast model will have parameters that depend only on (location, product id); price of a product on (location, product id, time); manufacturer on (product id, time). By using these as a multiindex, one can combine various data sources naturally. For example:

df1 = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': [6, 7, 8]}).set_index(['a', 'b'])
df2 = pd.DataFrame({'a': [1, 2], 'd': [10, 100]}).set_index('a')
result = df1['c'] * df2['d']
print(result)
# a  b
# 1  3     60
#    4     70
# 2  5    800
# dtype: int64

This is natural and pleasant. Without this, each operation needs to first join to a temporary frame.

@MarcoGorelli
Copy link
Member Author

That's a nice example, thanks Richard!

An alternative suggested by Brock was to only deprecate auto-alignment in dunder operations. So in your example,

>>> df1['c'] * df2['d']
ValueError: Operands are not aligned, use align before operating or use .mul instead of *

work throw an error, but

df1['c'].mul(df2['d'])

would work as it currently does

This would retain the natural and pleasant functionality for advanced users, whilst ending up with fewer surprises for others

@rhshadrach
Copy link
Member

Thanks @MarcoGorelli - while still having a way use these dunders with alignment does make me less negative to this proposal, I still think there are issues. Please correct any of these if they are wrong!

  • There would be no "natural" equivalent to df["a"] + df["b"] when a user wants to the operation while method chaining
  • There would be an inconsistent behavior between df.__add__ and df.add.
  • By only having alignment via .add et al, the user is forced into functional syntax if they want alignment. This can quite unreadable for even moderately complex formulae.

More generally, it seems to me the main motivation behind this proposal is that it would make pandas easier for new users. If there are other motivations (@jbrockmendel - I'm curious what your motivation is in particular), I think they would be good to identify. While I'm all for making pandas easier new users, I do not think we should be doing so at the expense of expert usage.

@jbrockmendel
Copy link
Member

(@jbrockmendel - I'm curious what your motivation is in particular)

I am not actively advocating the idea. I suggested it as an alternative to the NoIndex-mode given that the motivation seemed to be "automatic alignment is a major pain point".

@blazespinnaker
Copy link

A place where documentation could be very helpful is in auto alignment and correlation. The results can be very surprising at their confidence even though what you're doing makes no sense at all.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Dec 28, 2022

Thanks @rhshadrach , those are some valid points

You're right about what my main motivation is

Maybe we just need to document this more clearly, with a visible note in every operation which aligns redirecting to some page in the user guide


I've updated this to be a docs issue. If anyone would like to work on it, please do comment, happy to help out, it would be good to get better docs on this one

@MarcoGorelli MarcoGorelli removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 28, 2022
@MarcoGorelli MarcoGorelli changed the title API/DOC: better explain (or deprecate?) the automatic alignment process DOC: better explain the automatic alignment process Mar 31, 2023
@EltarrLok
Copy link

take

@rsm-23
Copy link
Contributor

rsm-23 commented Jun 30, 2023

Hey @MarcoGorelli I can take this up. I would need some guidance though :)

@MarcoGorelli
Copy link
Member Author

nice, thanks! I think all that's needed are some notes saying that in general, pandas operations align on the index, perhaps starting with .corr (which was part of the motivation for the issue)

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 2, 2023

take

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 2, 2023

@MarcoGorelli once this looks fine, I can create PRs for other functions.

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 3, 2023

@MarcoGorelli please review the PR when you get time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants