Skip to content

I case of dtype issues, read_csv doesn't give an error as useful as pd.to_numeric does #15898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
p-himik opened this issue Apr 5, 2017 · 7 comments · Fixed by #41607
Closed
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@p-himik
Copy link

p-himik commented Apr 5, 2017

A follow-up to #13237 . Copied examples:
Here's what to_numeric shows:

In [137]: pd.to_numeric(o)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas/src/inference.pyx in pandas.lib.maybe_convert_numeric (pandas/lib.c:55708)()

ValueError: Unable to parse string "52721156299871854681072370812488856336599863860274272781560003496046130816295143841376557767523688876482371118681808060318819187029855011862637267061548684520491431327693943042785705302632892888308342787190965977140539558800921069356177102235514666302335984730634641934384020650987601849970936578094137344.00000"

And here's what read_csv shows (the data is at ftp://ftp.sanger.ac.uk/pub/consortia/ibdgenetics/iibdgc-trans-ancestry-summary-stats.tar):

In [138]: d = pd.read_csv('EUR.UC.gwas.assoc', delim_whitespace=True, usecols=['OR'], dtype={'OR': np.float64})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
pandas/parser.pyx in pandas.parser.TextReader._convert_tokens (pandas/parser.c:14411)()

TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

[... long stacktrace ...]

pandas/parser.pyx in pandas.parser.TextReader._convert_tokens (pandas/parser.c:14632)()

ValueError: cannot safely convert passed user dtype of float64 for object dtyped data in column 8

@jreback
I've finally started looking into it, and it seems that I can't implement it in a good way without changing NumPy because, in the end, it's NumPy who doesn't give any row/value information, albeit Pandas conditionally changes the exception to its own.

I can write an ad-hoc implementation for numeric conversion using pd.to_numeric though, and use its row/value information in case it raises an exception. What do you think?

@jreback
Copy link
Contributor

jreback commented Apr 5, 2017

@p-himik yeah I suppose its possible to try/except the .astype (as this would happen only on an error); you have to be careful to only capture the last one though (as the type conversions go thru a series of code which try various type conversions).

Then you could run lib.maybe_convert_numeric (which has the nice error message; this is what to_numeric calls). Also happy to have improvement in the other errors messages there (to add the position).

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate IO CSV read_csv, to_csv labels Apr 5, 2017
@jreback jreback added this to the Next Major Release milestone Apr 5, 2017
@chris-b1
Copy link
Contributor

chris-b1 commented Apr 7, 2017

This is actually kind of complex, as we're only using numpy for secondary type conversion (e.g. float64 -> float32), the actual string to number parsing is in the low level csv parsing code.

I do think it would be possible, and nice, to catch the case where the user dtype is float and float parsing fails, basically would need to bubble up the error from this line, conditional on being passed a float dtype.

@simonjayhawkins
Copy link
Member

does anyone know if this issue is stale or has it been sorted? i've not tried to reproduce the code sample yet.

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 6, 2018

I'd encourage you to try the sample, or make a simpler reproducing case, but as far as I know this is still an issue.

@simonjayhawkins
Copy link
Member

thanks @chris-b1 , i'm unsure how to reproduce the sample code, so i guess coming up with a simpler case is the way forward.

@simonjayhawkins
Copy link
Member

simpler sample code example:

>>> import pandas as pd
>>>
>>> pd.to_numeric('1.7976931348623157e+308')
Traceback (most recent call last):
  File "pandas/_libs/src\inference.pyx", line 1152, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "1.7976931348623157e+308"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\Anaconda3\lib\site-packages\pandas\core\tools\numeric.py", line 133, in to_numeric
    coerce_numeric=coerce_numeric)
  File "pandas/_libs/src\inference.pyx", line 1185, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "1.7976931348623157e+308" at position 0
>>> from io import StringIO
>>>
>>> import numpy as np
>>> import pandas as pd
>>>
>>> csv_str = StringIO(('1.7976931348623157e+308, 1.7976931348623157e+308 '))
>>> df = pd.read_csv(csv_str, engine='c', names=["a", "b"], dtype={
...                  "a": np.str, "b": np.float64})
Traceback (most recent call last):
  File "pandas\_libs\parsers.pyx", line 1156, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\simon\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\simon\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "C:\Users\simon\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "C:\Users\simon\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas\_libs\parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas\_libs\parsers.pyx", line 1164, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: cannot safely convert passed user dtype of float64 for object dtyped data in column 1

@mroeschke
Copy link
Member

This looks to work on master now. Could use a test

In [5]: >>> import pandas as pd
   ...: >>>
   ...: >>> pd.to_numeric('1.7976931348623157e+308')
Out[5]: 1.7976931348623157e+308

In [6]: >>> from io import StringIO
   ...: >>>
   ...: >>> import numpy as np
   ...: >>> import pandas as pd
   ...: >>>
   ...: >>> csv_str = StringIO(('1.7976931348623157e+308, 1.7976931348623157e+308 '))
   ...: >>> df = pd.read_csv(csv_str, engine='c', names=["a", "b"], dtype={
   ...: ...                  "a": np.str, "b": np.float64})
<ipython-input-6-0fb2b8c2cf42>:8: DeprecationWarning: `np.str` is a deprecated alias for the builtin `str`. To silence this warning, use `str` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.str_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  "a": np.str, "b": np.float64})

In [7]: df
Out[7]:
                         a              b
0  1.7976931348623157e+308  1.797693e+308

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels May 8, 2021
@mroeschke mroeschke mentioned this issue May 21, 2021
10 tasks
@jreback jreback modified the milestones: Contributions Welcome, 1.3 May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants