Skip to content

Refactor DataFrame.replace to dispatch on types #5541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nspies opened this issue Nov 18, 2013 · 10 comments
Closed

Refactor DataFrame.replace to dispatch on types #5541

nspies opened this issue Nov 18, 2013 · 10 comments
Assignees
Labels

Comments

@nspies
Copy link
Contributor

nspies commented Nov 18, 2013

I'm having trouble figuring out how DataFrame.replace() is supposed to work. I'm not sure if this is a bug or a documentation issue.

In [1]: import pandas

In [2]: df = pandas.DataFrame({"col1":range(5), "col2":[0.5]*3+[1.0]*2})

In [3]: df
Out[3]: 
   col1  col2
0     0   0.5
1     1   0.5
2     2   0.5
3     3   1.0
4     4   1.0

In [4]: df.replace(1.0, "a")
Out[4]: 
  col1 col2
0    0  0.5
1    a  0.5
2    2  0.5
3    3    a
4    4    a

In [5]: df.replace(1.0, "a").replace(0.5, "b")
Out[5]: 
  col1 col2
0    0    b
1    a    b
2    2    b
3    3    a
4    4    a

So far, so good, everything makes sense. But I would have expected this to accomplish the same as above:


In [6]: df.replace({1.0:"a", 0.5:"b"})
Out[6]: 
  col1 col2
0    b    b
1    a    a
2    2    b
3    3    a
4    4    b

As you can see, I'm getting alternating "b" and "a". From a quick browse of the source code, it seems that the dictionary-replacement option should result in the same outcome as the following (which gives what I would have expected):

In [15]: df.replace([1.0, 0.5], ["a", "b"])
Out[15]: 
  col1 col2
a    0    b
b    a    b
c    2    b
d    3    a
e    4    a

I'm not sure what the to_replace=dict option is supposed to be doing but (at least for pandas v 0.12.0) it isn't doing what I would have expected.

Whether this is a bug or not, the df.replace() method needs better documentation. It's not enough to include a disclaimer that "This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works."

@cpcloud
Copy link
Member

cpcloud commented Nov 18, 2013

What version of pandas are you using? I believe this is fixed in master.

On Monday, November 18, 2013, Noah wrote:

I'm having trouble figuring out how DataFrame.replace() is supposed to
work. I'm not sure if this is a bug or a documentation issue.

In [1]: import pandas

In [2]: df = pandas.DataFrame({"col1":range(5), "col2":[0.5]_3+[1.0]_2})

In [3]: df
Out[3]:
col1 col2
0 0 0.5
1 1 0.5
2 2 0.5
3 3 1.0
4 4 1.0

In [4]: df.replace(1.0, "a")
Out[4]:
col1 col2
0 0 0.5
1 a 0.5
2 2 0.5
3 3 a
4 4 a

In [5]: df.replace(1.0, "a").replace(0.5, "b")
Out[5]:
col1 col2
0 0 b
1 a b
2 2 b
3 3 a
4 4 a

So far, so good, everything makes sense. But I would have expected this to
accomplish the same as above:

In [6]: df.replace({1.0:"a", 0.5:"b"})
Out[6]:
col1 col2
0 b b
1 a a
2 2 b
3 3 a
4 4 b

As you can see, I'm getting alternating "b" and "a". From a quick browse
of the source code, it seems that the dictionary-replacement option should
result in the same outcome as the following (which gives what I would have
expected):

In [15]: df.replace([1.0, 0.5], ["a", "b"])
Out[15]:
col1 col2
a 0 b
b a b
c 2 b
d 3 a
e 4 a

I'm not sure what the to_replace=dict option is supposed to be doing but
(at least for pandas v 0.12.0) it isn't doing what I would have expected.

Whether this is a bug or not, the df.replace() method needs better
documentation. It's not enough to include a disclaimer that "This method
has a lot of options. You are encouraged to experiment and play with this
method to gain intuition about how it works."


Reply to this email directly or view it on GitHubhttps://github.com//issues/5541
.

Best,
Phillip Cloud

@nspies
Copy link
Contributor Author

nspies commented Nov 18, 2013

You are correct. Let me amend the above then to regard simply the documentation of what is clearly a really complicated replace interface. There should at least be examples of the different types of syntax (I couldn't find any anywhere in the docs).

@jankatins
Copy link
Contributor

See also #5338

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

@cpcloud ok...not sure what is to be done here

@nspies would you care to provide a doc update as a PR seeing that their is some confusion

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 6, 2014
@cpcloud
Copy link
Member

cpcloud commented Apr 6, 2014

Yes ... the replace API is quite crowded. It's extremely flexible, but that flexibility seems to cause confusion among users (me too) and it's (sometimes) a PITA to debug (though debugging could be made easier by an internal refactor with no API changes).

@jreback maybe we should refactor this API a bit. I think replace tries to do too much. I think it might actually explode very soon. Maybe something like

  • replace: scalar replacement
  • replace_list: replaces anything where to_replace or value is a list
  • replace_mapping: likewise for dicts/Series
  • replace_regex: replaces only regexes, alternatively allow a regex argument in the others, similar to how it works now.

While this does add more names to the namespace, it's less of a strain on the working memory since you can look at some code with this API and say "oh, this is replacement using a [dict, list, etc]" rather than tracing all the way back up the stack to discover that the types weren't what you thought.

@cpcloud
Copy link
Member

cpcloud commented Apr 6, 2014

This is a breaking API change, so it may have to wait until 1.0 or whatever the first breaking API release is.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

@cpcloud nice suggestions

maybe could do instead use a an object for replace

e.g.

  • df.replace(...)
  • df.replace.list(...)
  • df.replace.mapping(...)
  • df.replace.regex(...)

in theory this would also allow to dispatch in a backwards compat way, e.g. the __call__ of the replace object could just be current replace with dispatching (and eventually deprecate it to only scalars) (you can do this compat dispatching whether you do it with an object based replace or not I think)

@cpcloud
Copy link
Member

cpcloud commented Apr 6, 2014

Nice. I like the object based idea. Gives time for users to adapt.

@cpcloud cpcloud added API Design and removed Docs labels Apr 6, 2014
@nspies
Copy link
Contributor Author

nspies commented Apr 7, 2014

Yes, I approve of some refactoring here! The current method is too powerful for its own good (although the documentation has improved substantially since I filed this ticket).

I haven't done any contributions to the project as of yet, but I'm happy to do at least the documentation parts if someone refactors the code into bits that I can understand.

On Apr 6, 2014, at 8:33 AM, Phillip Cloud [email protected] wrote:

Nice. I like the object based idea. Gives time for users to adapt.


Reply to this email directly or view it on GitHub.

@cpcloud cpcloud self-assigned this Jun 1, 2014
@cpcloud cpcloud changed the title What does DataFrame.replace() do? Refactor DataFrame.replace to dispatch on types Jun 7, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@jbrockmendel jbrockmendel added the replace replace method label Sep 17, 2020
@mroeschke mroeschke added Enhancement and removed API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 11, 2021
@mroeschke mroeschke removed the Refactor Internal refactoring of code label Apr 11, 2021
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@mroeschke
Copy link
Member

This looks to work correctly on main now and I believe we have tests for this so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants