-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add replace
method to Index
#32542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oguzhanogreden Thank you so much for taking this!
Overall this looks really good!
Some minor comments :)
Also a question, can any of the new added tests be parameteraize? |
Thanks @MomIsBestFriend, helpful indeed. I'll be waiting for a green flag from the core before I turn to docstrings and type annotations. Mainly since they take a lot of time... 😇 namely (1) figuring out the mechanics of docstrings for different subclasses and (2) correct types for type annotations. |
@oguzhanogreden ATM you can ignore the CI failure of Just be sure to merge master every now and then :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check if your implementation is faster than:
import pandas
def replace(self, *args, **kwargs):
return pandas.Index(self.to_series().replace(*args, **kwargs))
pandas.Index.replace = replace
idx = pandas.Index([1, 2, 3, 4, 5, 6])
idx.replace(3, 9)
Are there cases when converting to Series
is a problem?
Seems like a huge complexity added that we can hopefully avoid.
Regarding the speed point, this seems faster. At the very least, faster more often. Be warned I'm not on top of what's going on # Running on a MacBook Pro from 2017, under some use load
# (browsers with many tabs are open etc.)
import numpy as np
import pandas as pd
idx = pd.Index([1, 2, 3, 4, 5, 6])
%timeit -n 10000 -r 50 idx.replace(3, 9)
# 69.9 µs ± 6.67 µs per loop (mean ± std. dev. of 50 runs, 10000 loops each)
def replace(self, *args, **kwargs):
return pd.Index(self.to_series().replace(*args, **kwargs))
pd.Index.replace = replace
idx = pd.Index([1, 2, 3, 4, 5, 6])
%timeit -n 10000 -r 50 idx.replace(3, 9)
# 348 µs ± 102 µs per loop (mean ± std. dev. of 50 runs, 10000 loops each)
# ^^^ couldn't get higher precision here, also not a much lower point estimate Happy to move this to |
Regarding the complexity point - agreed. Though I think there's some benefit to having these methods available. I found myself in a situation where I "just wanted to be able to do what I know from other parts of the pandas API". It's up to you folks to judge that against maintenance burden, ultimately. I'd be interested in a contributing to a refactor around these methods. However I can imagine it'll be pretty tricky to abstract things away. Considering the nitty-gritty involved... I'll survey properly whether i Finally, I just added a few |
Just to be clear, my proposal was not that you use that code (with the So, you'll still get the same functionality in pandas |
Thanks. That was clear. I made the point about complexity since it's not clear to me how to weigh the performance difference against growing the code base so much :) I suggest we follow up once I do the following:
|
I think we'd be much better off with
|
@oguzhanogreden do you have time to merge in master and adapt the implementation @jbrockmendel proposed? Would be beneficial to have a shared implementation to reduce code maintenance. |
Sorry, I didn't get to prioritize this for a while. I'll try to get to it in a few days, with the suggested solution. |
): | ||
raise NotImplementedError( | ||
"Replacing values of a CategoricalIndex is not supported." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remind me why we need to disable this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise it hits #37899.
edit: That is, it hits that issue after the proposed changes due to inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So some cases will hit the same bug that affects Series.replace and other cases will work fine (assuming we don't raise here)? If so, then we're better off matching Series behavior and allowing the subset of cases that do work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disabled regex=
due to #38447 . Rest is enabled and added tests are added.
pandas/core/shared_docs.py
Outdated
|
||
>>> df = pd.DataFrame({{'A': [True, False, True], | ||
... 'B': [False, True, False]}}) | ||
>>> df.replace({{'a string': 'new value', True: False}}) # raises |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the failing doctests.
L578 doesn't raise an error anymore. The behavior seems to have changed in upstream master
as well, I'll change the docstring and add a test case to fix behavior prospectively.
(I can't replicate the build errors locally.) |
@@ -655,3 +655,24 @@ def _delegate_method(self, name: str, *args, **kwargs): | |||
if is_scalar(res): | |||
return res | |||
return CategoricalIndex(res, name=self.name) | |||
|
|||
def replace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not adding @doc
here since it will require a lot of conditionals due to regex being disabled. We can add it after replace()
is sorted out.
@oguzhanogreden can you do a pre-cursor PR that moves things and doesn't change anything (except for imports and so on), no functionailty at all. Its almost impossible to tell what you are changing otherwise. |
Now that #38561 is done, what's going here is a bit clearer (@jreback). Failing test is also failing on master, not related to changes here. Let me try to summarise to help with review:
Thanks for your patience here! It took too long in part due to my wrong initial approach (huge PR) and in part due to my absence from last March. Looking forward to getting this out of the door and paying back the investment ;) |
ok can you merge master and will look |
This was looking pretty close but looks like this PR has gotten stale. Going to close but if interested in continuing please ping, merge master, and target this PR for the next release (1.3 as of writing this comment). |
replace
method toIndex
objects #19495black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
ENH:
commit messageAdded a replace method to Index classes, as well as tests.