Skip to content

Problem with DataFrame.replace using None #26050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
josegcpa opened this issue Apr 11, 2019 · 7 comments
Closed

Problem with DataFrame.replace using None #26050

josegcpa opened this issue Apr 11, 2019 · 7 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@josegcpa
Copy link

josegcpa commented Apr 11, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

ar = np.random.normal(size=[100,10])
df = pd.DataFrame(ar).astype(str)

df.replace('0',None)

Problem description

Every time I run something similar to this (replacing, in a string DataFrame/DataFrame column, a specific string with None), the columns get filled with the element directly above it in the column; this creates a lot of confusion since None is, pretty much, THE "pythonic" way of signalling the absence of a variable/whatever. It works fine if replaced with np.nan, but if anything, I would expect them to work very approximately the same.

Expected Output

The expected output would be all strings matching the expression in replace to either be replaced with either None or NaN, or at least a warning message alerting for this behaviour.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.3
setuptools: 40.6.2
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Apr 11, 2019

Please provide a code sample that can reproduce the issue clearly

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Apr 11, 2019
@josegcpa
Copy link
Author

josegcpa commented Apr 11, 2019

The code sample can reproduce the issue clearly. But to make an even more obvious issue:

import pandas as pd
import numpy as np

k = np.array([['0','0','0','0'],['0','1','0','1'],['0','0','0','0']])
pd.DataFrame(k).replace('0',None)

Expected output

   0  1  2  3
0  None  None  None  None
1  None  1  None  1
2  None  None  None  None

Actual output

   0  1  2  3
0  0  0  0  0
1  0  1  0  1
2  0  1  0  1

@WillAyd
Copy link
Member

WillAyd commented Apr 11, 2019

OK thanks - the second example is much clearer.

It is interesting that this way of calling replace will get you what you want:

pd.DataFrame(k).replace({'0': None})

So something might be off with inference. In any case investigation and PRs are always welcome

@WillAyd WillAyd added Bug and removed Needs Info Clarification about behavior needed to assess issue labels Apr 11, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Apr 11, 2019
@WillAyd WillAyd added the DataFrame DataFrame data structure label Apr 11, 2019
@jschendel
Copy link
Member

jschendel commented Apr 11, 2019

The issue here is that the value parameter of replace defaults to None, so as written the code can't distinguish between when a user explicitly passes None and when a user doesn't provide anything and the default value None is what's being provided. The code is then falling back to the method parameter, which is causing the unexpected behavior you're seeing, e.g. see df.replace('0', None, method='bfill') causes the opposite type of filling.

The workaround here is to use object as the sentinel value instead of None. The implementation will be similar to what I did in #25069.

None is, pretty much, THE "pythonic" way of signalling the absence of a variable/whatever.

A little bit of a quibble, but I'd disagree with this to a certain extent. This might be true in the context of the Python language itself, but None is not idiomatic in the numpy/pandas realm, or in scientific computing in general. The major issue with None is that it does not meet the IEEE criteria for a missing value indicator, namely None == None returns True, whereas the IEEE specifications dictate that this evaluation should be False, which is the case for np.nan == np.nan.

My advice is to always use np.nan when operating in the numpy/pandas space, unless you have a very specific reason to use None. We generally try to accept None for the sake of compatibility, but this can lead to some weird inconsistencies and could result in unexpected behavior, e.g. see #20442.

RuchitaD1 pushed a commit to RuchitaD1/pandas that referenced this issue Aug 18, 2019
RuchitaD1 pushed a commit to RuchitaD1/pandas that referenced this issue Aug 18, 2019
RuchitaD1 pushed a commit to RuchitaD1/pandas that referenced this issue Aug 18, 2019
@mroeschke mroeschke added the replace replace method label Apr 20, 2020
@simonjayhawkins
Copy link
Member

closing as duplicate of #19998. ping to reopen if I'm missing something.

@NeoWang9999
Copy link

@simonjayhawkins why it's closed? the problem still remain.

@simonjayhawkins
Copy link
Member

@simonjayhawkins why it's closed? the problem still remain.

see #26050 (comment)

@simonjayhawkins simonjayhawkins added Duplicate Report Duplicate issue or pull request and removed Bug DataFrame DataFrame data structure replace replace method labels Sep 3, 2020
@simonjayhawkins simonjayhawkins removed this from the Contributions Welcome milestone Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

6 participants