Skip to content

DataFrame.replace(dict) has weird behaviour in some cases #5338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jankatins opened this issue Oct 26, 2013 · 24 comments · Fixed by #6429
Closed

DataFrame.replace(dict) has weird behaviour in some cases #5338

jankatins opened this issue Oct 26, 2013 · 24 comments · Fixed by #6429
Labels
API Design Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@jankatins
Copy link
Contributor

import pandas as pd
df = pd.DataFrame({"color":[1,2,3,4]})
print df
   color
0      1
1      2
2      3
3      4
print df.replace({"color":{"1":"2","3":"4",}}) # works but shouldn't?
  color
0     2
1     2
2     4
3     4
print df.replace({"color":{"1":"2","2":"3","3":"4","4":"5"}}) # strange
  color
0     2
1     4
2     3
3     5
print df.replace({"color":{1:"2",2:"3",3:"4",4:"5"}}) # works by replacing each cell once
  color
0     2
1     3
2     4
3     5

df = pd.DataFrame({"color":["1","2","3","4"]})
print df
  color
0     1
1     2
2     3
3     4
print df.replace({"color":{"1":"2","3":"4",}}) # works
  color
0     2
1     2
2     4
3     4
print df.replace({"color":{"1":"2","2":"3","3":"4","4":"5"}}) # works not
  color
0     3
1     3
2     5
3     5
print df.replace({"color":{1:"2",2:"3",3:"4",4:"5"}}) # works as expected: shouldn't replace anything!
  color
0     1
1     2
2     3
3     4

So, my expected behaviour would be:

  • don't replace a cell if the type of the cell does not match the key (as it is the case when a string cell is replaced by a int key)
  • if a value of a cell is replaced, the cell shouldn't be replaced a second time in the same replace call

I found the problem when I tried to match string values to colors and got blown up color values: like {"3":"#123456","4":"#000000"} wouldn't convert "3" into "#123#00000056"

Edit: insert string cell cases and my expected behaviour and deleted the intial comments which had these examples

jdavidson pushed a commit to jdavidson/ggplot that referenced this issue Nov 26, 2013
In the case a original value (e.g. 0-9A-F) was part of a color
defintion, using df.replace{dict} would do a double replace on the
same cell, resulting in colors like "#12345#ABCDEF", which would
then throw an error during plotting.

Using apply(lambda x: replacements[x]) is probably slower, so should
be replaced when the pandas bug is gone.

See pandas-dev/pandas#5338 for the bugreport
in pandas.
@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

@cpcloud can you look at this?

@cpcloud
Copy link
Member

cpcloud commented Feb 18, 2014

yep

@cpcloud cpcloud self-assigned this Feb 18, 2014
@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

I guess I'm mr. replace now

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

:)

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

@JanSchulz @jreback As I'm sure you both know

df.replace({"color":{"1":"2","2":"3","3":"4","4":"5"}}) 

is not well-defined. For example, replace might replace '1' with '2' first, resulting in [2, 2, 3, 4] then replace '2' with '3' and so on ...

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

What should pandas do here? I don't think it's easy to detect this kind of "chained" replacement. I suppose you could keep track of and incrementally replace not sure if that's worth it tho

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

With a non-nested dict like

df.replace({'1': '2', '2': '3'})

exactly what I described above is happening

@jreback
Copy link
Contributor

jreback commented Feb 19, 2014

I think u should raise on a dict that had keys and values overlap

however an ordereddict would be ok

@jreback
Copy link
Contributor

jreback commented Feb 19, 2014

I think if

set(d.keys()) & set(d.values()) is not empty then u raise

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

ordereddict would still have the same problem, since you'll just end up replace each current value with the next replacement

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

suppose they're iterated over in order. then

1->2 [2, 2, 3, 4]
2->3 [3, 3, 3, 4]
3->4 [4, 4, 4, 4]
4->5 [5, 5, 5, 5]

@jreback
Copy link
Contributor

jreback commented Feb 19, 2014

ok then just raise if their is overlap then
much simpler

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

yep good call

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

Hm okay it seems to be only happening with strings .. . let me investigate some more

@jankatins
Copy link
Contributor Author

I would find it interesting if the individual replacements are not done one after another to the whole row but all together so that that [1,2,3,4,5] ends up as [2,3,4,5,6].

Right now (and if that raises),I have to figure out the replacement order in python when I construct a replacement dict (probably via a loop), but I would like to have the code which does that in pandas so that it is fast.

@cpcloud
Copy link
Member

cpcloud commented Feb 19, 2014

Yep. I'll check it out tonight. It should work the way you expect, since it works for ints.

@cpcloud cpcloud added the Bug label Feb 21, 2014
@cpcloud
Copy link
Member

cpcloud commented Feb 21, 2014

It looks like there's a somewhat subtle bug here. It's only apparent in certain cases (dict replace with a nested dict).

This "works":

In [6]: df = DataFrame({'a': range(4)})

In [7]: df.replace({'a': {0: 1, 1: 2, 2: 3, 3: 4}})
Out[7]:
   a
0  1
1  2
2  3
3  4

[4 rows x 1 columns]

But this is broken (it shouldn't return the same as the previous call to replace:

In [8]: df = DataFrame({'a': permutation(4)})

In [9]: df
Out[9]:
   a
0  1
1  0
2  3
3  2

[4 rows x 1 columns]

In [10]: df.replace({'a': {0: 1, 1: 2, 2: 3, 3: 4}})
Out[10]:
   a
0  1
1  2
2  3
3  4

[4 rows x 1 columns]

So, in fact, it doesn't work except in this trivial case, which inadvertently depends on how the dict is ordered.

I vote for banning this as @jreback suggested above by raising if the keys and values overlap.

@jreback
Copy link
Contributor

jreback commented Feb 21, 2014

agreed
punt to the user they can always iteratively replace at a higher level

@cpcloud
Copy link
Member

cpcloud commented Feb 21, 2014

i'm kind of down on these nested dict replaces right now 😡 (probably just debugging anger) Is there a use case for them other than convenience?

@cpcloud
Copy link
Member

cpcloud commented Feb 21, 2014

@JanSchulz If you want to do this kind of replacement you can do

In [7]: df = DataFrame({'a': permutation(4)})

In [8]: df
Out[8]:
   a
0  3
1  1
2  2
3  0

[4 rows x 1 columns]

In [9]: df['a'] = df.a.replace({0: 1, 1: 2, 3: 4, 2: 3})

In [10]: df
Out[10]:
   a
0  4
1  2
2  3
3  1

[4 rows x 1 columns]

@jankatins
Copy link
Contributor Author

So given a replacementdict, where where keys and values overlap (the above example), df.replace({col: replacementdict}) throws and error but df.col.replace(replacementdict) does not? Wouldn't it then be easier to switch to a loop internally than throwing an error in some data specific cases (read: not obvious and only reproduceable with certain data sets)?

@cpcloud
Copy link
Member

cpcloud commented Feb 21, 2014

That seems like it should be possible I'll investigate tonight.

@jankatins
Copy link
Contributor Author

@cpcloud This issue is closed, but your last comment indicates that something is still to be done here.

@auxiliary
Copy link

Any updates on this? Apparently it shouldn't be closed.

has2k1 pushed a commit to has2k1/plotnine that referenced this issue Apr 25, 2017
In the case a original value (e.g. 0-9A-F) was part of a color
defintion, using df.replace{dict} would do a double replace on the
same cell, resulting in colors like "#12345#ABCDEF", which would
then throw an error during plotting.

Using apply(lambda x: replacements[x]) is probably slower, so should
be replaced when the pandas bug is gone.

See pandas-dev/pandas#5338 for the bugreport
in pandas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants