Skip to content

to_numeric not correctly converting decimal type to float #21551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MikeWoodward opened this issue Jun 20, 2018 · 3 comments · Fixed by #21585
Closed

to_numeric not correctly converting decimal type to float #21551

MikeWoodward opened this issue Jun 20, 2018 · 3 comments · Fixed by #21585
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@MikeWoodward
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import decimal 

# Set up date
data = {'line': [1,2,3,4],
        'decimal': [12345678.1, 12345678.01, 12345678.001, 12345678.0001],
        'original': [12345678.1, 12345678.01, 12345678.001, 12345678.0001]}
df = pd.DataFrame(data)

df.to_csv('df.csv', index=False)

# Convert 'decimal' to decimal type
df = pd.read_csv('df.csv', converters={'decimal': decimal.Decimal})

df['float_n'] = pd.to_numeric(df['decimal'], downcast='float', errors='coerce')
df['float_a'] = df['decimal'].astype(float)

for index, row in df.iterrows():
    print(row['float_n'], row['float_a'])

Problem description

I'm using to_numeric to convert the results of a database query from a decimal type to a float. I've found that to_numeric rounds the result in some cases giving an incorrect answer. Converting the data using astype correctly converts the data.

I can't reproduce my example here because it uses a database call and it's a work project, but I have created sample code that shows the same behavior.

Here's the output from my example:

12345678.0 12345678.1
12345678.0 12345678.01
12345678.0 12345678.001
12345678.0 12345678.0001

The first column is the result of the to_numeric conversion, the second column is the result of the astype(float) conversion. These columns should contain the same values, but they don't. to_numeric is rounding the results and astype isn't.

Expected Output

12345678.1 12345678.1
12345678.01 12345678.01
12345678.001 12345678.001
12345678.0001 12345678.0001

In other words, to_numeric and astype should give the same result - to_numeric should not round the data.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.2
pip: 10.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

You included downcast='float' in the to_numeric call, which makes that the resulting dtype is float32, which cannot hold the full precision.

You get a similar effect if doing the same operation on the original floats (not the decimals):

In [7]:  pd.to_numeric(df['original'], downcast='float', errors='coerce')
Out[7]: 
0    12345678.0
1    12345678.0
2    12345678.0
3    12345678.0
Name: original, dtype: float32

@MikeWoodward
Copy link
Author

MikeWoodward commented Jun 21, 2018

I see the point, but I have two comments:

  • astype and to_numeric 'should' behave in the same way for the same arguments. I understand that astype(float) is giving float64 and to_numeric(downcast='float') is giving me float32, but it does seem inconsistent and not obvious.
  • the documentation could be clearer on this point.

Maybe to_numeric could have more specific options for conversion, like float64? I like the 'coerce' error argument option in to_numeric so I don't want to move to astype.

@jorisvandenbossche
Copy link
Member

astype and to_numeric 'should' behave in the same way for the same arguments

Yes, but you were not using the same arguments (as you noted yourself)? So then we should not expect it to be the same?

the documentation could be clearer on this point.

Sure! It can indeed be much clearer that the default dtype is float64 or int64 depending on your data, and downcast only should be used to differ from that default.
A PR is certainly welcome for that.

Maybe to_numeric could have more specific options for conversion, like float64?

Is it that you want to be sure it is float64 and not int64 depending on your data? (or what is exactly the use case?)
As once you have float-like data, the output will always be float64.

@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations IO CSV read_csv, to_csv Docs labels Jun 21, 2018
@gfyoung gfyoung removed the IO CSV read_csv, to_csv label Jun 22, 2018
@jreback jreback added this to the 0.24.0 milestone Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants