Skip to content

DataFrame.replace() overwrites when values are non-numeric #16051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cx0 opened this issue Apr 18, 2017 · 5 comments · Fixed by #29359
Closed

DataFrame.replace() overwrites when values are non-numeric #16051

cx0 opened this issue Apr 18, 2017 · 5 comments · Fixed by #29359
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@cx0
Copy link

cx0 commented Apr 18, 2017

Code Sample, a copy-pastable example if possible

In [1]: pd.Series([1,2,3]).replace({1 : 2, 2 : 3, 3 : 4})
Out[1]: 
0    2
1    3
2    4
dtype: int64

In [2]: pd.Series(['1','2','3']).replace({'1' : '2', '2' : '3', '3' : '4'})
Out[2]:
0    4
1    4
2    4
dtype: object

Problem description

I'd expect the replacement over values in a dataframe to be non-transitive. Suppose that we would like to replace a with b, and b with c. When this replacement is applied to an entry containing the value a, replacement rules are propagated and therefore c is returned instead of b. Same replacement is not transitive (as shown in example code) for numeric values.

I think this default behavior should be mentioned explicitly in the documentation. It would also be nice to have a Boolean option to set the transitivity on/off.

Expected Output

Out[2]:
0    2
1    3
2    4
dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

@chris-b1
Copy link
Contributor

Yeah, I'd consider this more a bug than intended behavior. Deep in the replace code, if dtype=object is being replaced on, a recursive path is used, not entirely sure why, but probably could be changed to to do only 1 pass like you're expecting.

if b.dtype == np.object_:

@chris-b1 chris-b1 added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Apr 18, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Apr 18, 2017
@chris-b1
Copy link
Contributor

xref #5541, #5338

@prcastro
Copy link
Contributor

prcastro commented Oct 16, 2017

Today, when replacing a DataFrame with nested mapping {col: {key: value}}, an error is raised when keys and values are overlapping. However, this is actually due to this bug, that happens no matter if the mapping is nested or not.

Wouldn't it be more consistent to raise the error when the keys and values are overlapping and values are non-numeric, instead of only raising when the mapping is nested?

This would render DataFrame replacing with nested mapping {col: {key: value}} usable whenever a loop could be used, while raising an the error in every place it is necessary.

@petroswork
Copy link

I don't know if I'm hitting the same bug, but replace corrupts a specific DataFrame I tried it against, running Pandas 0.21 on Python 3.6, Linux. To wit:

In [51]: tmp.iloc[15771]
Out[51]: ''

In [52]: tmp.replace('').iloc[15771]
Out[52]: '[email protected]'

however,

In [55]: tmp.replace([''], [None]).iloc[15771] is None
Out[55]: True

I have no idea how [email protected], which is the value of another row, appeared here. This result is repeated for several rows.

@mroeschke
Copy link
Member

Looks like this is fixed on master. Could use a test.

In [88]: In [2]: pd.Series(['1','2','3']).replace({'1' : '2', '2' : '3', '3' : '4'})
    ...:
Out[88]:
0    2
1    3
2    4
dtype: object

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Dtype Conversions Unexpected or buggy dtype conversions labels Oct 27, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Nov 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants