DataFrame.replace() overwrites when values are non-numeric #16051

cx0 · 2017-04-18T19:28:43Z

Code Sample, a copy-pastable example if possible

In [1]: pd.Series([1,2,3]).replace({1 : 2, 2 : 3, 3 : 4})
Out[1]: 
0    2
1    3
2    4
dtype: int64

In [2]: pd.Series(['1','2','3']).replace({'1' : '2', '2' : '3', '3' : '4'})
Out[2]:
0    4
1    4
2    4
dtype: object

Problem description

I'd expect the replacement over values in a dataframe to be non-transitive. Suppose that we would like to replace a with b, and b with c. When this replacement is applied to an entry containing the value a, replacement rules are propagated and therefore c is returned instead of b. Same replacement is not transitive (as shown in example code) for numeric values.

I think this default behavior should be mentioned explicitly in the documentation. It would also be nice to have a Boolean option to set the transitivity on/off.

Expected Output

Out[2]:
0    2
1    3
2    4
dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-04-18T20:06:59Z

Yeah, I'd consider this more a bug than intended behavior. Deep in the replace code, if dtype=object is being replaced on, a recursive path is used, not entirely sure why, but probably could be changed to to do only 1 pass like you're expecting.

pandas/pandas/core/internals.py

Line 3271 in 2522efa

if b.dtype == np.object_:

chris-b1 · 2017-04-18T20:11:20Z

xref #5541, #5338

prcastro · 2017-10-16T11:06:35Z

Today, when replacing a DataFrame with nested mapping {col: {key: value}}, an error is raised when keys and values are overlapping. However, this is actually due to this bug, that happens no matter if the mapping is nested or not.

Wouldn't it be more consistent to raise the error when the keys and values are overlapping and values are non-numeric, instead of only raising when the mapping is nested?

This would render DataFrame replacing with nested mapping {col: {key: value}} usable whenever a loop could be used, while raising an the error in every place it is necessary.

petroswork · 2017-12-09T19:20:51Z

I don't know if I'm hitting the same bug, but replace corrupts a specific DataFrame I tried it against, running Pandas 0.21 on Python 3.6, Linux. To wit:

In [51]: tmp.iloc[15771]
Out[51]: ''

In [52]: tmp.replace('').iloc[15771]
Out[52]: '[email protected]'

however,

In [55]: tmp.replace([''], [None]).iloc[15771] is None
Out[55]: True

I have no idea how [email protected], which is the value of another row, appeared here. This result is repeated for several rows.

mroeschke · 2019-10-27T18:35:45Z

Looks like this is fixed on master. Could use a test.

In [88]: In [2]: pd.Series(['1','2','3']).replace({'1' : '2', '2' : '3', '3' : '4'})
    ...:
Out[88]:
0    2
1    3
2    4
dtype: object

chris-b1 added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Apr 18, 2017

chris-b1 added this to the Next Major Release milestone Apr 18, 2017

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Dtype Conversions Unexpected or buggy dtype conversions labels Oct 27, 2019

pv8473h12 mentioned this issue Nov 2, 2019

GH 16051: DataFrame.replace() overwrites when values are non-numeric #29359

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Nov 2, 2019

WillAyd closed this as completed in #29359 Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.replace() overwrites when values are non-numeric #16051

DataFrame.replace() overwrites when values are non-numeric #16051

cx0 commented Apr 18, 2017 •

edited

Loading

chris-b1 commented Apr 18, 2017

chris-b1 commented Apr 18, 2017

prcastro commented Oct 16, 2017 •

edited

Loading

petroswork commented Dec 9, 2017

mroeschke commented Oct 27, 2019

DataFrame.replace() overwrites when values are non-numeric #16051

DataFrame.replace() overwrites when values are non-numeric #16051

Comments

cx0 commented Apr 18, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

chris-b1 commented Apr 18, 2017

chris-b1 commented Apr 18, 2017

prcastro commented Oct 16, 2017 • edited Loading

petroswork commented Dec 9, 2017

mroeschke commented Oct 27, 2019

cx0 commented Apr 18, 2017 •

edited

Loading

Output of `pd.show_versions()`

prcastro commented Oct 16, 2017 •

edited

Loading