Skip to content

ENH: Add replace method to Index #32542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 54 commits into from
Closed

ENH: Add replace method to Index #32542

wants to merge 54 commits into from

Conversation

oguzhanogreden
Copy link
Contributor

@oguzhanogreden oguzhanogreden commented Mar 8, 2020

Added a replace method to Index classes, as well as tests.

@oguzhanogreden oguzhanogreden changed the title Index replace ENH: Add replace method to Index Mar 8, 2020
Copy link
Member

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oguzhanogreden Thank you so much for taking this!

Overall this looks really good!


Some minor comments :)

@ShaharNaveh
Copy link
Member

Also a question, can any of the new added tests be parameteraize?

@oguzhanogreden
Copy link
Contributor Author

Thanks @MomIsBestFriend, helpful indeed. I'll be waiting for a green flag from the core before I turn to docstrings and type annotations. Mainly since they take a lot of time... 😇 namely (1) figuring out the mechanics of docstrings for different subclasses and (2) correct types for type annotations.

@ShaharNaveh
Copy link
Member

@oguzhanogreden ATM you can ignore the CI failure of CI/Web And Docs, we haven't yet resolved this issue. so you can ignore it. (for now at least)

Just be sure to merge master every now and then :)

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check if your implementation is faster than:

import pandas

def replace(self, *args, **kwargs):
    return pandas.Index(self.to_series().replace(*args, **kwargs))
pandas.Index.replace = replace

idx = pandas.Index([1, 2, 3, 4, 5, 6])
idx.replace(3, 9)

Are there cases when converting to Series is a problem?

Seems like a huge complexity added that we can hopefully avoid.

@datapythonista datapythonista added Enhancement Index Related to the Index class or subclasses labels Mar 10, 2020
@oguzhanogreden
Copy link
Contributor Author

Regarding the speed point, this seems faster. At the very least, faster more often. Be warned I'm not on top of what's going on %timeit mechanics. I chose the arguments below since they seemed reasonable from a measurement perspective. If there's a gotcha about performance, do let me know:

# Running on a MacBook Pro from 2017, under some use load
# (browsers with many tabs are open etc.)

import numpy as np

import pandas as pd

idx = pd.Index([1, 2, 3, 4, 5, 6])
%timeit -n 10000 -r 50 idx.replace(3, 9)

# 69.9 µs ± 6.67 µs per loop (mean ± std. dev. of 50 runs, 10000 loops each)

def replace(self, *args, **kwargs):
    return pd.Index(self.to_series().replace(*args, **kwargs))
pd.Index.replace = replace

idx = pd.Index([1, 2, 3, 4, 5, 6])
%timeit -n 10000 -r 50 idx.replace(3, 9)

# 348 µs ± 102 µs per loop (mean ± std. dev. of 50 runs, 10000 loops each)
# ^^^ couldn't get higher precision here, also not a much lower point estimate

Happy to move this to asv if the output is considerably more reliable.

@oguzhanogreden
Copy link
Contributor Author

Regarding the complexity point - agreed. Though I think there's some benefit to having these methods available. I found myself in a situation where I "just wanted to be able to do what I know from other parts of the pandas API". It's up to you folks to judge that against maintenance burden, ultimately.

I'd be interested in a contributing to a refactor around these methods. However I can imagine it'll be pretty tricky to abstract things away. Considering the nitty-gritty involved...

I'll survey properly whether i .to_series() fails to solve anything in the coming days.

Finally, I just added a few NotImplementedErrors where this method, as it stands, shouldn't return anything. Eventually I'm happy to address those, at a cost of some more complexity, if the overall judgement is these are useful to have upstream.

@datapythonista
Copy link
Member

Just to be clear, my proposal was not that you use that code (with the to_series()) in your own code. But that the implementation in this PR use it instead of all the code you're adding.

So, you'll still get the same functionality in pandas Index.replace(), but instead of adding 300 lines of code here, you'll be adding 30. But seems like the difference in performance is quite big.

@oguzhanogreden
Copy link
Contributor Author

Thanks. That was clear. I made the point about complexity since it's not clear to me how to weigh the performance difference against growing the code base so much :)

I suggest we follow up once I do the following:

  • check if to_series() can handle most of the requirements,
  • try and identify the cause of the performance issue, see if I can address and wrap .to_series() to create a solution,

@jbrockmendel
Copy link
Member

I think we'd be much better off with

def replace(self, whatever):
    result = self.to_series().replace(whatever)
    return Index(result)

@mroeschke
Copy link
Member

@oguzhanogreden do you have time to merge in master and adapt the implementation @jbrockmendel proposed? Would be beneficial to have a shared implementation to reduce code maintenance.

@oguzhanogreden
Copy link
Contributor Author

oguzhanogreden commented May 26, 2020

Sorry, I didn't get to prioritize this for a while. I'll try to get to it in a few days, with the suggested solution.

):
raise NotImplementedError(
"Replacing values of a CategoricalIndex is not supported."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me why we need to disable this?

Copy link
Contributor Author

@oguzhanogreden oguzhanogreden Nov 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise it hits #37899.

edit: That is, it hits that issue after the proposed changes due to inheritance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So some cases will hit the same bug that affects Series.replace and other cases will work fine (assuming we don't raise here)? If so, then we're better off matching Series behavior and allowing the subset of cases that do work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled regex= due to #38447 . Rest is enabled and added tests are added.


>>> df = pd.DataFrame({{'A': [True, False, True],
... 'B': [False, True, False]}})
>>> df.replace({{'a string': 'new value', True: False}}) # raises
Copy link
Contributor Author

@oguzhanogreden oguzhanogreden Nov 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the failing doctests.

L578 doesn't raise an error anymore. The behavior seems to have changed in upstream master as well, I'll change the docstring and add a test case to fix behavior prospectively.

@oguzhanogreden
Copy link
Contributor Author

(I can't replicate the build errors locally.)

@@ -655,3 +655,24 @@ def _delegate_method(self, name: str, *args, **kwargs):
if is_scalar(res):
return res
return CategoricalIndex(res, name=self.name)

def replace(
Copy link
Contributor Author

@oguzhanogreden oguzhanogreden Dec 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not adding @doc here since it will require a lot of conditionals due to regex being disabled. We can add it after replace() is sorted out.

@jreback
Copy link
Contributor

jreback commented Dec 13, 2020

@oguzhanogreden can you do a pre-cursor PR that moves things and doesn't change anything (except for imports and so on), no functionailty at all. Its almost impossible to tell what you are changing otherwise.

@oguzhanogreden
Copy link
Contributor Author

oguzhanogreden commented Jan 16, 2021

Now that #38561 is done, what's going here is a bit clearer (@jreback). Failing test is also failing on master, not related to changes here.

Let me try to summarise to help with review:

  1. The key change is the addition of .replace() method to Index classes.
  2. pd.Categorical does not allow regex replaces (see ENH: regex replace capacities are missing from pd.Categorical.replace() #38447), therefore we're disabling it.
  3. Due to the same reason, I have not added documentation to CategoricalIndex. This helps keep shared docs simple (i.e. without a conditional for regex stuff). Once the issues with categorical are solved, we can just add the decorator and it's good to go. I'll make an issue and reference it where needed, once this is merged.

Thanks for your patience here! It took too long in part due to my wrong initial approach (huge PR) and in part due to my absence from last March. Looking forward to getting this out of the door and paying back the investment ;)

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

ok can you merge master and will look

@mroeschke
Copy link
Member

This was looking pretty close but looks like this PR has gotten stale. Going to close but if interested in continuing please ping, merge master, and target this PR for the next release (1.3 as of writing this comment).

@mroeschke mroeschke closed this Apr 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Index Related to the Index class or subclasses replace replace method
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add replace method to Index objects
7 participants