DataFrame.replace(dict) has weird behaviour in some cases #5338

jankatins · 2013-10-26T11:04:26Z

import pandas as pd
df = pd.DataFrame({"color":[1,2,3,4]})
print df
   color
0      1
1      2
2      3
3      4
print df.replace({"color":{"1":"2","3":"4",}}) # works but shouldn't?
  color
0     2
1     2
2     4
3     4
print df.replace({"color":{"1":"2","2":"3","3":"4","4":"5"}}) # strange
  color
0     2
1     4
2     3
3     5
print df.replace({"color":{1:"2",2:"3",3:"4",4:"5"}}) # works by replacing each cell once
  color
0     2
1     3
2     4
3     5

df = pd.DataFrame({"color":["1","2","3","4"]})
print df
  color
0     1
1     2
2     3
3     4
print df.replace({"color":{"1":"2","3":"4",}}) # works
  color
0     2
1     2
2     4
3     4
print df.replace({"color":{"1":"2","2":"3","3":"4","4":"5"}}) # works not
  color
0     3
1     3
2     5
3     5
print df.replace({"color":{1:"2",2:"3",3:"4",4:"5"}}) # works as expected: shouldn't replace anything!
  color
0     1
1     2
2     3
3     4

So, my expected behaviour would be:

don't replace a cell if the type of the cell does not match the key (as it is the case when a string cell is replaced by a int key)
if a value of a cell is replaced, the cell shouldn't be replaced a second time in the same replace call

I found the problem when I tried to match string values to colors and got blown up color values: like {"3":"#123456","4":"#000000"} wouldn't convert "3" into "#123#00000056"

Edit: insert string cell cases and my expected behaviour and deleted the intial comments which had these examples

The text was updated successfully, but these errors were encountered:

In the case a original value (e.g. 0-9A-F) was part of a color defintion, using df.replace{dict} would do a double replace on the same cell, resulting in colors like "#12345#ABCDEF", which would then throw an error during plotting. Using apply(lambda x: replacements[x]) is probably slower, so should be replaced when the pandas bug is gone. See pandas-dev/pandas#5338 for the bugreport in pandas.

jreback · 2014-02-18T19:41:30Z

@cpcloud can you look at this?

cpcloud · 2014-02-18T20:12:08Z

yep

cpcloud · 2014-02-19T01:25:43Z

I guess I'm mr. replace now

cpcloud · 2014-02-19T01:25:51Z

:)

cpcloud · 2014-02-19T02:03:33Z

@JanSchulz @jreback As I'm sure you both know

df.replace({"color":{"1":"2","2":"3","3":"4","4":"5"}})

is not well-defined. For example, replace might replace '1' with '2' first, resulting in [2, 2, 3, 4] then replace '2' with '3' and so on ...

cpcloud · 2014-02-19T02:08:28Z

What should pandas do here? I don't think it's easy to detect this kind of "chained" replacement. I suppose you could keep track of and incrementally replace not sure if that's worth it tho

cpcloud · 2014-02-19T02:17:21Z

With a non-nested dict like

df.replace({'1': '2', '2': '3'})

exactly what I described above is happening

jreback · 2014-02-19T02:19:17Z

I think u should raise on a dict that had keys and values overlap

however an ordereddict would be ok

jreback · 2014-02-19T02:20:22Z

I think if

set(d.keys()) & set(d.values()) is not empty then u raise

cpcloud · 2014-02-19T02:22:10Z

ordereddict would still have the same problem, since you'll just end up replace each current value with the next replacement

cpcloud · 2014-02-19T02:23:27Z

suppose they're iterated over in order. then

1->2 [2, 2, 3, 4]
2->3 [3, 3, 3, 4]
3->4 [4, 4, 4, 4]
4->5 [5, 5, 5, 5]

jreback · 2014-02-19T02:24:50Z

ok then just raise if their is overlap then
much simpler

cpcloud · 2014-02-19T02:24:58Z

yep good call

cpcloud · 2014-02-19T02:34:29Z

Hm okay it seems to be only happening with strings .. . let me investigate some more

jankatins · 2014-02-19T19:23:12Z

I would find it interesting if the individual replacements are not done one after another to the whole row but all together so that that [1,2,3,4,5] ends up as [2,3,4,5,6].

Right now (and if that raises),I have to figure out the replacement order in python when I construct a replacement dict (probably via a loop), but I would like to have the code which does that in pandas so that it is fast.

cpcloud · 2014-02-19T19:25:14Z

Yep. I'll check it out tonight. It should work the way you expect, since it works for ints.

cpcloud · 2014-02-21T04:19:31Z

It looks like there's a somewhat subtle bug here. It's only apparent in certain cases (dict replace with a nested dict).

This "works":

In [6]: df = DataFrame({'a': range(4)})

In [7]: df.replace({'a': {0: 1, 1: 2, 2: 3, 3: 4}})
Out[7]:
   a
0  1
1  2
2  3
3  4

[4 rows x 1 columns]

But this is broken (it shouldn't return the same as the previous call to replace:

In [8]: df = DataFrame({'a': permutation(4)})

In [9]: df
Out[9]:
   a
0  1
1  0
2  3
3  2

[4 rows x 1 columns]

In [10]: df.replace({'a': {0: 1, 1: 2, 2: 3, 3: 4}})
Out[10]:
   a
0  1
1  2
2  3
3  4

[4 rows x 1 columns]

So, in fact, it doesn't work except in this trivial case, which inadvertently depends on how the dict is ordered.

I vote for banning this as @jreback suggested above by raising if the keys and values overlap.

jreback · 2014-02-21T04:23:27Z

agreed
punt to the user they can always iteratively replace at a higher level

cpcloud · 2014-02-21T04:25:53Z

i'm kind of down on these nested dict replaces right now 😡 (probably just debugging anger) Is there a use case for them other than convenience?

cpcloud · 2014-02-21T05:07:10Z

@JanSchulz If you want to do this kind of replacement you can do

In [7]: df = DataFrame({'a': permutation(4)})

In [8]: df
Out[8]:
   a
0  3
1  1
2  2
3  0

[4 rows x 1 columns]

In [9]: df['a'] = df.a.replace({0: 1, 1: 2, 3: 4, 2: 3})

In [10]: df
Out[10]:
   a
0  4
1  2
2  3
3  1

[4 rows x 1 columns]

jankatins · 2014-02-21T19:41:26Z

So given a replacementdict, where where keys and values overlap (the above example), df.replace({col: replacementdict}) throws and error but df.col.replace(replacementdict) does not? Wouldn't it then be easier to switch to a loop internally than throwing an error in some data specific cases (read: not obvious and only reproduceable with certain data sets)?

cpcloud · 2014-02-21T20:10:30Z

That seems like it should be possible I'll investigate tonight.

jankatins · 2014-07-10T21:15:32Z

@cpcloud This issue is closed, but your last comment indicates that something is still to be done here.

auxiliary · 2015-11-20T06:44:33Z

Any updates on this? Apparently it shouldn't be closed.

In the case a original value (e.g. 0-9A-F) was part of a color defintion, using df.replace{dict} would do a double replace on the same cell, resulting in colors like "#12345#ABCDEF", which would then throw an error during plotting. Using apply(lambda x: replacements[x]) is probably slower, so should be replaced when the pandas bug is gone. See pandas-dev/pandas#5338 for the bugreport in pandas.

jankatins mentioned this issue Nov 18, 2013

Refactor DataFrame.replace to dispatch on types #5541

Closed

cpcloud self-assigned this Feb 18, 2014

cpcloud added the Bug label Feb 21, 2014

cpcloud mentioned this issue Feb 21, 2014

BUG: punt to user when passing overlapping replacement values in a nested dict #6429

Merged

cpcloud closed this as completed in #6429 Feb 21, 2014

wesm unassigned cpcloud Oct 12, 2016

chris-b1 mentioned this issue Apr 18, 2017

DataFrame.replace() overwrites when values are non-numeric #16051

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.replace(dict) has weird behaviour in some cases #5338

DataFrame.replace(dict) has weird behaviour in some cases #5338

jankatins commented Oct 26, 2013

jreback commented Feb 18, 2014

cpcloud commented Feb 18, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

jreback commented Feb 19, 2014

jreback commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

jreback commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

jankatins commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 21, 2014

jreback commented Feb 21, 2014

cpcloud commented Feb 21, 2014

cpcloud commented Feb 21, 2014

jankatins commented Feb 21, 2014

cpcloud commented Feb 21, 2014

jankatins commented Jul 10, 2014

auxiliary commented Nov 20, 2015

DataFrame.replace(dict) has weird behaviour in some cases #5338

DataFrame.replace(dict) has weird behaviour in some cases #5338

Comments

jankatins commented Oct 26, 2013

jreback commented Feb 18, 2014

cpcloud commented Feb 18, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

jreback commented Feb 19, 2014

jreback commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

jreback commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 19, 2014

jankatins commented Feb 19, 2014

cpcloud commented Feb 19, 2014

cpcloud commented Feb 21, 2014

jreback commented Feb 21, 2014

cpcloud commented Feb 21, 2014

cpcloud commented Feb 21, 2014

jankatins commented Feb 21, 2014

cpcloud commented Feb 21, 2014

jankatins commented Jul 10, 2014

auxiliary commented Nov 20, 2015