Skip to content

BUG: integer overflow in csv_reader #47167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
SandroCasagrande opened this issue May 30, 2022 · 2 comments
Open
3 tasks done

BUG: integer overflow in csv_reader #47167

SandroCasagrande opened this issue May 30, 2022 · 2 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@SandroCasagrande
Copy link
Contributor

SandroCasagrande commented May 30, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# content of pandas/tests/io/parser/test_example.py
from io import StringIO

import numpy as np
import pytest
from pandas._libs.parsers import TextReader
from pandas.api.types import is_extension_array_dtype

import pandas._testing as tm
from pandas import array
from pandas.io.parsers.c_parser_wrapper import ensure_dtype_objs


@pytest.mark.parametrize(
    "dtype", [
        "uint64", "int64", "uint32", "int32", "uint16", "int16", "uint8", "int8",
        "UInt64","Int64", "UInt32", "Int32", "UInt16", "Int16", "UInt8", "Int8"
    ]
)
def test_integer_overflow_with_user_dtype(dtype):
    dtype = ensure_dtype_objs(dtype)
    is_ext_dtype = is_extension_array_dtype(dtype)
    maxint = np.iinfo(dtype.type if is_ext_dtype else dtype).max

    reader = TextReader(StringIO(f"{maxint}"), header=None, dtype=dtype)
    result = reader.read()
    if is_ext_dtype:
        expected = array([maxint], dtype=dtype)
        tm.assert_extension_array_equal(result[0], expected)
    else:
        expected = np.array([maxint], dtype=dtype)
        tm.assert_numpy_array_equal(result[0], expected)

    reader = TextReader(StringIO(f"{maxint + 1}"), header=None, dtype=dtype)
    with pytest.raises(Exception):
        result = reader.read()
        print(result, end=" ")
$ pytest pandas/tests/io/parser/test_example.py -sv
========================================================================================================= test session starts ==========================================================================================================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 -- /opt/conda/bin/python
cachedir: .pytest_cache
hypothesis profile 'ci' -> deadline=None, suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase('/home/pandas/.hypothesis/examples')
rootdir: /home/pandas, configfile: pyproject.toml
plugins: cython-0.2.0, xdist-2.5.0, cov-3.0.0, asyncio-0.18.3, forked-1.4.0, hypothesis-6.46.9, instafail-0.4.1
asyncio: mode=strict
collected 16 items                                                                                                                                                                                                                     

pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int64] {0: array([9223372036854775808], dtype=uint64)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint32] {0: array([0], dtype=uint32)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int32] {0: array([-2147483648], dtype=int32)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint16] {0: array([0], dtype=uint16)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int16] {0: array([-32768], dtype=int16)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint8] {0: array([0], dtype=uint8)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int8] {0: array([-128], dtype=int8)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt32] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int32] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt16] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int16] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt8] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int8] PASSED

...

Issue Description

As the example shows, all invocations with the extension dtype variants (UInt64, etc.) and with the non-extension dtype uint64 manage to parse the max-value but fail at max + 1 with an exception (more specifically we get an OverflowError for uint64, a ValueError for UInt64 and TypeErrors for all other extension dtypes, so I simpled checked for any exception in the example). This is the safe and IMHO expected behavior.

The issue arises when parsing an integer value with a user defined dtype TextReader(..., dtype != None) and only for non-extension dtypes:

  1. Requesting int64, we obtain maxint64 + 1 as a uint64. This is at least safe, but not expected and different from the behavior of Int64.
  2. For all other non-extension dtypes, a silent overflow occurs

The second problem comes from

result = result.astype(dtype)
where the default casting="unsafe" parameter is used. Furthermore, for int64, we do not reach this line and just return with the result from _try_uint64.

Expected Behavior

Non-extension integer dtypes should have the same behavior like the extension dtypes, i.e. only return exactly the requested dtype (if specified by the user) and raise when this dtype is insufficient to hold the parsed value.

Installed Versions

1.5.0.dev0+839.gc355145c7f

@SandroCasagrande SandroCasagrande added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 30, 2022
@SandroCasagrande
Copy link
Contributor Author

A possible solution is to include a check after the cast like for the extension dtypes:

casted = values.astype(dtype, copy=copy)
if (casted == values).all():
return casted

A PR follows ...

@topper-123
Copy link
Contributor

This fails a bit differently now:

========================================================================================================= short test summary info =========================================================================================================
FAILED ../startup.py::test_integer_overflow_with_user_dtype[int64] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[uint32] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[int32] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[uint16] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[int16] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[uint8] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[int8] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[Int64] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[UInt32] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[Int32] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[UInt16] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[Int16] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[UInt8] - Failed: DID NOT RAISE <class 'Exception'>
FAILED ../startup.py::test_integer_overflow_with_user_dtype[Int8] - Failed: DID NOT RAISE <class 'Exception'>
====================================================================================================== 14 failed, 2 passed in 0.58s =======================================================================================================

However, TextReader isn't part of the API, so we don't give guarantees about how it works, so concretely this isn't the bug. The bug is when using read_csv, e.g.

>>> from io import StringIO
>>> import numpy as np
>>> import pandas as pd
>>> dtype = np.dtype(np.int16)
>>> maxint = np.iinfo(dtype).max
>>> text = f"{maxint + 1}"
>>> pd.DataFrame([maxint + 1], dtype= dtype)  # fails
ValueError: Values are too large to be losslessly converted to int16. To cast anyway, use pd.Series(values).astype(int16)
>>> pd.read_csv(StringIO(text), header=None, dtype=dtype)  # overflows
       0
0 -32768

I.e. read_csv behaves differently from the dataframe constructor and IMO it should behave the same way. So I'll change the title of the PR to reflect that (and the solution to both issues may be the same.

@topper-123 topper-123 changed the title BUG: integer overflow in TextReader BUG: integer overflow in csv_reader May 8, 2023
@topper-123 topper-123 added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants