Skip to content

ENH: Support high precision converters in to_numeric #19463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xmduhan opened this issue Jan 30, 2018 · 8 comments
Closed

ENH: Support high precision converters in to_numeric #19463

xmduhan opened this issue Jan 30, 2018 · 8 comments
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@xmduhan
Copy link

xmduhan commented Jan 30, 2018

import pandas as pd
from pandas.compat import StringIO
data = StringIO("""
a, b
1, 1
1.7976931348623157e+308, 1
""")
pd.read_csv(data, dtype={'a': float, 'b': float}, engine='python')

This work!

import pandas as pd
from pandas.compat import StringIO
data = StringIO("""
a, b
1, 1
1.7976931348623157e+308, 1
""")
pd.read_csv(data, dtype={'a': float, 'b': float}, engine='c')

This fail!
Message: ValueError: cannot safely convert passed user dtype of float64 for object dtyped data in column 0

Full error stack:

ValueErrorTraceback (most recent call last)
<ipython-input-4-f830b6441439> in <module>()
      5 1.7976931348623157e+308, 1
      6 """)
----> 7 pd.read_csv(data, dtype={'a': float, 'b': float}, engine='c')

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    453 
    454     try:
--> 455         data = parser.read(nrows)
    456     finally:
    457         parser.close()

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in read(self, nrows)
   1067                 raise ValueError('skipfooter not supported for iteration')
   1068 
-> 1069         ret = self._engine.read(nrows)
   1070 
   1071         if self.options.get('as_recarray'):

/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.pyc in read(self, nrows)
   1837     def read(self, nrows=None):
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:
   1841             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

ValueError: cannot safely convert passed user dtype of float64 for object dtyped data in column 0

### pd.show_versions(): INSTALLED VERSIONS ------------------ commit: None python: 2.7.6.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-139-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: zh_CN.UTF-8 LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.2
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.6.0

@chris-b1
Copy link
Contributor

For reasons not entirely clear to me, our xstrtod implementation can't parse the maximum 64 bit float. xref #17154, #19361

A workaround is to use the 'high precision' float parser - some discussion in #17154 about making that the default.

data = StringIO("""
a, b
1, 1
1.7976931348623157e+308, 1
""")
pd.read_csv(data, dtype={'a': float, 'b': float}, engine='c', float_precision='high')

@chris-b1 chris-b1 added Bug IO Data IO issues that don't fit into a more specific label Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 30, 2018
@chris-b1
Copy link
Contributor

Also applies to pd.to_numeric

pd.to_numeric(['1.7976931348623157e+308'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "1.7976931348623157e+308"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-29-15b8074a875c> in <module>()
----> 1 pd.to_numeric(['1.7976931348623157e+308'])

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast)
    131             coerce_numeric = False if errors in ('ignore', 'raise') else True
    132             values = lib.maybe_convert_numeric(values, set(),
--> 133                                                coerce_numeric=coerce_numeric)
    134 
    135     except Exception:

pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "1.7976931348623157e+308" at position 0

@jreback
Copy link
Contributor

jreback commented Jan 30, 2018

this is not a bug, rather the exact reason for the high precision option. there is a similar issue IIRC. see if you can find it. cc @gfyoung

@gfyoung
Copy link
Member

gfyoung commented Jan 31, 2018

@jreback : Agreed. This is intended behavior.

@chris-b1 : However, your example with pd.to_numeric strikes me as interesting...I feel like we would want to support conversion to numeric with high-precision (if specified)? @jreback : Thoughts?

@jreback
Copy link
Contributor

jreback commented Jan 31, 2018

yeah i think to_numeric should use the high prevsion converters (though not sure if this goes thru the same path)

@gfyoung
Copy link
Member

gfyoung commented Jan 31, 2018

though not sure if this goes thru the same path

AFAICT, the xlstrtod functionality is just for the parsers.py file. Though if you want to integrate with it, you would probably have to add several more parameters to the to_numeric interface for it to provide similar functionality as in read_csv for example (e.g. decimal and sci as the floating point options used in tokenizer.h) OR perhaps we just hard-code values for those parameters.

@mroeschke mroeschke changed the title c engine can't load data which python engine do! ENH: Support high precision converters in to_numeric May 2, 2020
@mroeschke
Copy link
Member

Since the csv behavior is expected, repurposing this issue for the to_numeric enhancement

@mroeschke mroeschke added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations and removed Numeric Operations Arithmetic, Comparison, and Logical operations IO Data IO issues that don't fit into a more specific label Usage Question labels May 2, 2020
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 8, 2020

to_numeric now uses the high precision by default #36149

@Dr-Irv Dr-Irv closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

6 participants