to_numeric not correctly converting decimal type to float #21551

MikeWoodward · 2018-06-20T02:08:45Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import decimal 

# Set up date
data = {'line': [1,2,3,4],
        'decimal': [12345678.1, 12345678.01, 12345678.001, 12345678.0001],
        'original': [12345678.1, 12345678.01, 12345678.001, 12345678.0001]}
df = pd.DataFrame(data)

df.to_csv('df.csv', index=False)

# Convert 'decimal' to decimal type
df = pd.read_csv('df.csv', converters={'decimal': decimal.Decimal})

df['float_n'] = pd.to_numeric(df['decimal'], downcast='float', errors='coerce')
df['float_a'] = df['decimal'].astype(float)

for index, row in df.iterrows():
    print(row['float_n'], row['float_a'])

Problem description

I'm using to_numeric to convert the results of a database query from a decimal type to a float. I've found that to_numeric rounds the result in some cases giving an incorrect answer. Converting the data using astype correctly converts the data.

I can't reproduce my example here because it uses a database call and it's a work project, but I have created sample code that shows the same behavior.

Here's the output from my example:

12345678.0 12345678.1
12345678.0 12345678.01
12345678.0 12345678.001
12345678.0 12345678.0001

The first column is the result of the to_numeric conversion, the second column is the result of the astype(float) conversion. These columns should contain the same values, but they don't. to_numeric is rounding the results and astype isn't.

Expected Output

12345678.1 12345678.1
12345678.01 12345678.01
12345678.001 12345678.001
12345678.0001 12345678.0001

In other words, to_numeric and astype should give the same result - to_numeric should not round the data.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.2
pip: 10.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-06-20T15:59:54Z

You included downcast='float' in the to_numeric call, which makes that the resulting dtype is float32, which cannot hold the full precision.

You get a similar effect if doing the same operation on the original floats (not the decimals):

In [7]:  pd.to_numeric(df['original'], downcast='float', errors='coerce')
Out[7]: 
0    12345678.0
1    12345678.0
2    12345678.0
3    12345678.0
Name: original, dtype: float32

MikeWoodward · 2018-06-21T13:02:13Z

I see the point, but I have two comments:

astype and to_numeric 'should' behave in the same way for the same arguments. I understand that astype(float) is giving float64 and to_numeric(downcast='float') is giving me float32, but it does seem inconsistent and not obvious.
the documentation could be clearer on this point.

Maybe to_numeric could have more specific options for conversion, like float64? I like the 'coerce' error argument option in to_numeric so I don't want to move to astype.

jorisvandenbossche · 2018-06-21T14:38:41Z

astype and to_numeric 'should' behave in the same way for the same arguments

Yes, but you were not using the same arguments (as you noted yourself)? So then we should not expect it to be the same?

the documentation could be clearer on this point.

Sure! It can indeed be much clearer that the default dtype is float64 or int64 depending on your data, and downcast only should be used to differ from that default.
A PR is certainly welcome for that.

Maybe to_numeric could have more specific options for conversion, like float64?

Is it that you want to be sure it is float64 and not int64 depending on your data? (or what is exactly the use case?)
As once you have float-like data, the output will always be float64.

gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations IO CSV read_csv, to_csv Docs labels Jun 21, 2018

uds5501 mentioned this issue Jun 22, 2018

DOC: Adding clarification on return dtype of to_numeric #21585

Merged

4 tasks

gfyoung removed the IO CSV read_csv, to_csv label Jun 22, 2018

jreback added this to the 0.24.0 milestone Jun 22, 2018

jreback closed this as completed in #21585 Jun 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_numeric not correctly converting decimal type to float #21551

to_numeric not correctly converting decimal type to float #21551

MikeWoodward commented Jun 20, 2018

INSTALLED VERSIONS

jorisvandenbossche commented Jun 20, 2018

MikeWoodward commented Jun 21, 2018 •

edited

Loading

jorisvandenbossche commented Jun 21, 2018

to_numeric not correctly converting decimal type to float #21551

to_numeric not correctly converting decimal type to float #21551

Comments

MikeWoodward commented Jun 20, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Jun 20, 2018

MikeWoodward commented Jun 21, 2018 • edited Loading

jorisvandenbossche commented Jun 21, 2018

Output of `pd.show_versions()`

MikeWoodward commented Jun 21, 2018 •

edited

Loading