Skip to content

BUG: to_numeric does not validate the errors keyword #26394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sorenwacker opened this issue May 14, 2019 · 12 comments · Fixed by #26466
Closed

BUG: to_numeric does not validate the errors keyword #26394

sorenwacker opened this issue May 14, 2019 · 12 comments · Fixed by #26466
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@sorenwacker
Copy link

sorenwacker commented May 14, 2019

From discussion below: pd.to_numeric does not validate the value passed to the errors keyword, so any random value is interpreted as errors='coerce'.


Original report:

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'Strings': ['fire', 'hose'], 'Numbers': ['3838.2', '99']})
print(df.apply(pd.to_numeric, args={'errors': 'ignore'}).to_string())

Problem description

The code above should return:

Strings Numbers
0 'fire' 3838.2
1 'hose' 99.0

Instead it returns:

Strings Numbers
0 NaN 3838.2
1 NaN 99.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-48-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.4.1
pip: 19.1
setuptools: 41.0.0
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: 0.10.0
xarray: 0.12.1
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.7
lxml.etree: 4.3.3
bs4: 4.6.3
html5lib: 0.9999999
sqlalchemy: 1.3.3
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@gfyoung gfyoung added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels May 14, 2019
@gfyoung
Copy link
Member

gfyoung commented May 14, 2019

How odd! Investigation and PR are welcome!

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 14, 2019

The odd thing is that it actually works correctly on a single column:

In [8]: df = pd.DataFrame({'Strings': ['fire', 'hose'], 'Numbers': ['3838.2', '99']}) 

In [9]: df.apply(pd.to_numeric, args={'errors': 'ignore'}) 
Out[9]: 
   Strings  Numbers
0      NaN   3838.2
1      NaN     99.0

In [10]: pd.to_numeric(df['Strings'], errors='ignore') 
Out[10]: 
0    fire
1    hose
Name: Strings, dtype: object

In [11]: pd.to_numeric(df.iloc[0, :], errors='ignore')
Out[11]: 
Strings      fire
Numbers    3838.2
Name: 0, dtype: object

@jorisvandenbossche
Copy link
Member

Ah, but the args keyword of apply expects (a tuple of) positional arguments, so the dict is not seen as keyword argument for to_numeric, but just as a dict passed to the second argument of to_numeric.

With proper passing through, this works:

In [24]: df.apply(pd.to_numeric, errors='ignore')  # or .., **{'errors': 'ignore'})
Out[24]: 
  Strings  Numbers
0    fire   3838.2
1    hose     99.0

@jorisvandenbossche jorisvandenbossche added this to the No action milestone May 14, 2019
@jorisvandenbossche
Copy link
Member

The second argument of to_numeric is errors, so this was actually happening:

In [26]: pd.to_numeric(df['Strings'], errors={'errors': 'ignore'})                                                                                            
Out[26]: 
0   NaN
1   NaN
Name: Strings, dtype: float64

So this is actually a bug, as to_numeric should validate the value passed to errors keyword (currently you have something like coerce_numeric = errors not in ('ignore', 'raise') in the code, which will incorrectly give True in this case).

@soerendip interested in doing a PR to fix that validation?

@sorenwacker
Copy link
Author

I am, but I am also really, really, really busy right now. So, I can probably work on this in approx 3 weeks.

@gfyoung gfyoung removed this from the No action milestone May 14, 2019
@gfyoung gfyoung added the Error Reporting Incorrect or improved errors from pandas label May 14, 2019
@gfyoung
Copy link
Member

gfyoung commented May 14, 2019

In light of #26394 (comment), reopening this issue.

@gfyoung gfyoung reopened this May 14, 2019
@jorisvandenbossche jorisvandenbossche changed the title pd.to_numeric(..., errors='ignore') returns NaN instead of the input. BUG: to_numeric does not validate the errors keyword May 15, 2019
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.25.0, Contributions Welcome May 15, 2019
@SummerGram
Copy link

Anyone doing it?

@gfyoung
Copy link
Member

gfyoung commented May 15, 2019

@SummerGram : Go for it!

@SummerGram
Copy link

SummerGram commented May 16, 2019

@jorisvandenbossche @gfyoung

It works if the args is changed as below:

df = pd.DataFrame({'Strings': ['fire', 'hose'], 'Numbers': ['3838.2', '99']}) print(df.apply(pd.to_numeric, args=**('ignore',)**).to_string())

In this case:

df = pd.DataFrame({'Strings': ['fire', 'hose'], 'Numbers': ['3838.2', '99']}) print(df.apply(pd.to_numeric, args={'errors': 'ignore'}).to_string())

ignore is not passed to the errors in the to_numeric function.

From the documentation, args is supposed to be tuple only. Any suggestions?

@jorisvandenbossche
Copy link
Member

@SummerGram it is not about the the args keyword of apply (that one is fine), this is only about the errors keyword in to_numeric.
Inside to_numeric, we should validate that the value passed to errors is only one of the three allowed values ('ignore', 'raise', 'coerce').

@SummerGram
Copy link

SummerGram commented May 17, 2019

@jorisvandenbossche the errors variable is retrieved in to_numeric and shown as below:

In[]: print(df.apply(numeric.to_numeric, args={'errors': 'ignore'}).to_string())
Out[]: 
errors
In[]: print(df.apply(numeric.to_numeric, args={'testing': 'ignore'}).to_string())
Out[]: 
testing

sss

@sumanau7
Copy link
Contributor

@gfyoung Hi, I have added a defensive check for argument errors in a similar way on how you added it for argument downcast, could you please review the PR and share your thoughts.

@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 0.25.0 May 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants