Skip to content

BUG: Raise on parse int overflow #47167 #47168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
8d0efca
TST: integer overflow on parsing with insufficient user dtype
SandroCasagrande May 30, 2022
27629f0
BUG: raise on integer overflow when parsing with insufficient user dtype
SandroCasagrande May 30, 2022
bfb0b89
Fixes from pre-commit [automated commit]
SandroCasagrande May 30, 2022
661c853
DOC: added entry in whatsnew
SandroCasagrande May 30, 2022
ccb6f61
Introduce emtpy match in pytest.raises for flake8
SandroCasagrande May 30, 2022
a3b458a
Changed import location of is_extension_array_dtype for type check
SandroCasagrande May 30, 2022
994a634
PERF: avoid try parse as int64 if user specified uint64
SandroCasagrande May 31, 2022
567fd58
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande May 31, 2022
43bcb22
TST: simple asv for uint8 parsing
SandroCasagrande May 31, 2022
61a36a5
BUG: stringio rewind in asv ReadCSVIndexCol
SandroCasagrande May 31, 2022
927ddad
CLN: simplified conditional logic for int parsing
SandroCasagrande Jun 6, 2022
d992af5
Merge remote-tracking branch 'upstream/main' into raise-on-parse-int-…
SandroCasagrande Jul 22, 2022
498b93d
TST: reduced repetition by using any_int_dtype in test
SandroCasagrande Jul 23, 2022
9e1fbbc
TST: added tests for read_csv with both engines c and python
SandroCasagrande Jul 23, 2022
ef91ab5
BUG: raise on integer overflow when parsing with insufficient user dtype
SandroCasagrande Jul 23, 2022
270eb90
TST: added/modified tests to raise on lossy float conversion due to s…
SandroCasagrande Jul 24, 2022
dd5cd0e
DOC: minor correction in test docstring
SandroCasagrande Jul 24, 2022
64047b4
DOC: explained changes in whatsnew in terms of public api
SandroCasagrande Jul 24, 2022
d812c32
TST: added missing skip_pyarrow mark
SandroCasagrande Jul 25, 2022
b1f83b9
TST: specified exceptions in pytest.raises
SandroCasagrande Jul 25, 2022
8f4c947
TST: replaced loop cases with parametrized tests
SandroCasagrande Jul 25, 2022
a935ac9
Merge remote-tracking branch 'upstream/main' into raise-on-parse-int-…
SandroCasagrande Jul 26, 2022
0c9f4e8
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Jul 27, 2022
8353cba
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Jul 28, 2022
60b3018
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Aug 4, 2022
3e5f929
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande Aug 9, 2022
ba40923
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Aug 12, 2022
3d72cf2
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Aug 13, 2022
3f39a5b
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Aug 16, 2022
7cf208f
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Aug 23, 2022
520cae3
CLN: moved na-check into else branch
SandroCasagrande Aug 23, 2022
88d8650
CLN: re-use maybe_cast_to_integer_array for checked cast in python pa…
SandroCasagrande Aug 25, 2022
d96d6b0
TST: specified expected exception
SandroCasagrande Aug 25, 2022
3508a9f
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Aug 25, 2022
2c16f74
TST: fixed int-overflow test
SandroCasagrande Aug 25, 2022
485dcfc
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Aug 26, 2022
5896e01
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande Sep 5, 2022
b276196
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Sep 6, 2022
a1a6764
CLN: create asv input without overflow to prevent potential warnings
SandroCasagrande Sep 6, 2022
c9e8a92
DOC: fixed wording in whatsnew
SandroCasagrande Sep 6, 2022
39b5c91
TST: split float to int coercion test into two separate tests
SandroCasagrande Sep 6, 2022
92fab59
TST: improved comment and referenced issue
SandroCasagrande Sep 6, 2022
f653e96
TST: aviod conditional raise
SandroCasagrande Sep 6, 2022
8b406ab
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande Sep 14, 2022
8d782fb
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande Sep 23, 2022
09e6773
Merge branch 'pandas-dev:main' into raise-on-parse-int-overflow
SandroCasagrande Oct 10, 2022
a545602
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande Oct 18, 2022
0439322
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande Nov 2, 2022
b392f32
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande Dec 15, 2022
fca428c
Merge branch 'main' into raise-on-parse-int-overflow
SandroCasagrande Dec 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion asv_bench/benchmarks/io/csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,15 @@ def time_read_uint64_na_values(self):
)


class ReadUint8Integers(StringIORewind):
def setup(self):
arr = np.tile(np.arange(256, dtype="uint8"), 50)
self.data1 = StringIO("\n".join(arr.astype(str).tolist()))

def time_read_uint8(self):
read_csv(self.data(self.data1), header=None, names=["foo"], dtype="uint8")


class ReadCSVThousands(BaseIO):

fname = "__test__.csv"
Expand Down Expand Up @@ -567,7 +576,7 @@ def setup(self):
self.StringIO_input = StringIO(data)

def time_read_csv_index_col(self):
read_csv(self.StringIO_input, index_col="a")
read_csv(self.data(self.StringIO_input), index_col="a")


from ..pandas_vb_common import setup # noqa: F401 isort:skip
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1136,6 +1136,8 @@ I/O
- Bug in :func:`read_parquet` with ``use_nullable_dtypes=True`` where ``float64`` dtype was returned instead of nullable ``Float64`` dtype (:issue:`45694`)
- Bug in :meth:`DataFrame.to_json` where ``PeriodDtype`` would not make the serialization roundtrip when read back with :meth:`read_json` (:issue:`44720`)
- Bug in :func:`read_xml` when reading XML files with Chinese character tags and would raise ``XMLSyntaxError`` (:issue:`47902`)
- Bug in :func:`read_csv` with specified numpy integer ``dtype`` can cause silent overflow or unexpected return dtype (:issue:`47167`)
- Bug in :func:`read_csv` with specified numpy integer ``dtype`` and ``engine="python"`` can cause silent lossy float coercion (:issue:`47167`)

Period
^^^^^^
Expand Down
35 changes: 24 additions & 11 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1189,19 +1189,32 @@ cdef class TextReader:
return result, na_count

elif is_integer_dtype(dtype):
try:
result, na_count = _try_int64(self.parser, i, start,
end, na_filter, na_hashset)
if user_dtype and na_count is not None:
if na_count > 0:
raise ValueError(f"Integer column has NA values in column {i}")
except OverflowError:
result = _try_uint64(self.parser, i, start, end,
na_filter, na_hashset)
if user_dtype and dtype == "uint64":
result = _try_uint64(self.parser, i, start,
end, na_filter, na_hashset)
na_count = 0
else:
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pattern here is to _try_dtype then if that fails try another one, can you just do that here (rather than all of this if/then logic).
e.g. add a _try_uint64 is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to stay with the pattern _try_int64 -> if fail -> _try_uint64, but shortcut the two cases with user_dtype and dtype in ["int64", "uint64"], where we can fail after the specific _try_dtype. I added another _try_uint64 instead of the do_try_uint64. Is that okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I agree with @jreback, this is hard to read

Copy link
Contributor Author

@SandroCasagrande SandroCasagrande Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree and added a _try_uint64 in commit 927ddad like @jreback suggested. Were you also referring to the former version with do_try_uint64 @phofl ? The latest version is https://github.com/SandroCasagrande/pandas/blob/5896e017ca2960e0d535c8c0a0b9db978377bc91/pandas/_libs/parsers.pyx#L1182-L1198. Sorry, if I did not signify correctly that I performed changes and the newest version should be reviewed again. Can you please have a look @jreback or @phofl?

result, na_count = _try_int64(self.parser, i, start,
end, na_filter, na_hashset)
except OverflowError as err:
if user_dtype and dtype == "int64":
raise err
result = _try_uint64(self.parser, i, start,
end, na_filter, na_hashset)
na_count = 0
else:
if user_dtype and (na_count is not None) and (na_count > 0):
raise ValueError(f"Integer column has NA values in column {i}")

if result is not None and dtype != "int64":
result = result.astype(dtype)
if result is not None and dtype not in ("int64", "uint64"):
casted = result.astype(dtype)
if (casted == result).all():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks expensive. Can you run asvs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. I thought the same thing, when I saw exactly this check being applied for parsing (and more generally setting) with the nullable integer extension dtypes

casted = values.astype(dtype, copy=copy)
if (casted == values).all():
return casted
. However, I could not come up with a better version (that does not repeat or use fused types(?) in _try_int64). And finally, the impact is negligible (see below).

Just to clarify: The additional check is never performed when parsing with automatically inferred dtype (since TextReader.dtype_cast_order contains only int64 and no other integer dtypes), nor with user specified dtype int64. I just comitted another change that prevents unnecessarily running into this check when parsing with user specified dtype uint64.
The check then only affects parsing with user specified integer dtype of size < 64 bit. As far as I can see, none of the existing asvs covers this.

Just to be sure I ran some existing asvs that seemed most relevant and found no significant change:

       before           after         ratio
     [c355145c]       [994a6345]
     <raise-on-parse-int-overflow~7>       <raise-on-parse-int-overflow>
       6.97±0.3ms       7.46±0.7ms     1.07  io.csv.ParseDateComparison.time_read_csv_dayfirst(False)
       3.43±0.1ms      3.39±0.04ms     0.99  io.csv.ParseDateComparison.time_read_csv_dayfirst(True)
       6.93±0.3ms       7.02±0.1ms     1.01  io.csv.ParseDateComparison.time_to_datetime_dayfirst(False)
       3.52±0.3ms      3.28±0.08ms     0.93  io.csv.ParseDateComparison.time_to_datetime_dayfirst(True)
       17.2±0.2ms      17.3±0.06ms     1.00  io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(False)
       3.33±0.2ms       3.35±0.5ms     1.01  io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(True)
      1.74±0.01ms      1.74±0.09ms     1.00  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'c')
      2.82±0.03ms      2.81±0.03ms     0.99  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'python')
      1.77±0.08ms      1.79±0.02ms     1.01  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'c')
      2.85±0.02ms       2.87±0.1ms     1.01  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'python')
       25.8±0.8ms       26.3±0.8ms     1.02  io.csv.ReadCSVCategorical.time_convert_direct('c')
          179±3ms          180±4ms     1.01  io.csv.ReadCSVCategorical.time_convert_direct('python')
         44.0±1ms         44.4±1ms     1.01  io.csv.ReadCSVCategorical.time_convert_post('c')
          169±4ms          168±3ms     0.99  io.csv.ReadCSVCategorical.time_convert_post('python')
         21.8±1ms         21.5±1ms     0.99  io.csv.ReadCSVComment.time_comment('c')
       22.5±0.5ms       21.8±0.7ms     0.97  io.csv.ReadCSVComment.time_comment('python')
       24.9±0.9ms         25.1±2ms     1.01  io.csv.ReadCSVConcatDatetime.time_read_csv
         13.1±1ms       12.0±0.3ms     0.92  io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')
       9.37±0.3ms       9.38±0.1ms     1.00  io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
       14.7±0.4ms         15.0±1ms     1.02  io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
         86.1±2ms         87.7±2ms     1.02  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'custom')
      1.55±0.02ms      1.57±0.02ms     1.01  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'iso8601')
      1.47±0.01ms      1.51±0.04ms     1.03  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(False, 'ymd')
      4.39±0.06ms       4.45±0.1ms     1.01  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'custom')
      1.80±0.05ms      1.86±0.06ms     1.03  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'iso8601')
      2.00±0.01ms      2.00±0.02ms     1.00  io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(True, 'ymd')
       19.4±0.3ms         19.8±1ms     1.02  io.csv.ReadCSVEngine.time_read_bytescsv('c')
       6.25±0.4ms       6.45±0.6ms     1.03  io.csv.ReadCSVEngine.time_read_bytescsv('pyarrow')
          303±2ms          301±3ms     0.99  io.csv.ReadCSVEngine.time_read_bytescsv('python')
       20.0±0.6ms       19.3±0.5ms     0.96  io.csv.ReadCSVEngine.time_read_stringcsv('c')
       7.43±0.5ms       6.93±0.1ms     0.93  io.csv.ReadCSVEngine.time_read_stringcsv('pyarrow')
          264±2ms          262±6ms     0.99  io.csv.ReadCSVEngine.time_read_stringcsv('python')
      1.33±0.02ms      1.34±0.02ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
      2.23±0.02ms      2.22±0.02ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
      1.35±0.06ms      1.36±0.03ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
      1.39±0.01ms      1.41±0.01ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
      1.41±0.01ms      1.44±0.04ms     1.02  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
       1.47±0.2ms      1.43±0.02ms     0.98  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
      1.30±0.02ms      1.33±0.04ms     1.03  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
      2.21±0.06ms       2.25±0.1ms     1.02  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
      1.35±0.07ms      1.33±0.03ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
      1.42±0.03ms      1.41±0.01ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'high')
      1.41±0.01ms       1.47±0.2ms     1.05  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
      1.40±0.01ms      1.41±0.01ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
       3.43±0.3ms       3.40±0.2ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'high')
       3.45±0.2ms      3.34±0.03ms     0.97  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
      3.41±0.03ms       3.42±0.2ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', None)
      2.83±0.03ms      2.82±0.02ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'high')
      2.81±0.01ms      2.82±0.03ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'round_trip')
      2.83±0.09ms      2.80±0.02ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', None)
      3.35±0.03ms      3.32±0.03ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'high')
       3.52±0.2ms       3.55±0.2ms     1.01  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'round_trip')
       3.50±0.1ms       3.76±0.2ms     1.08  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', None)
      2.80±0.02ms      2.79±0.02ms     0.99  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'high')
      2.84±0.07ms      2.83±0.07ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
      2.81±0.03ms      2.81±0.04ms     1.00  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', None)
      9.98±0.07ms       9.88±0.3ms     0.99  io.csv.ReadCSVIndexCol.time_read_csv_index_col
       69.0±0.4ms       74.1±0.3ms     1.07  io.csv.ReadCSVMemMapUTF8.time_read_memmapped_utf8
                0                0      n/a  io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('c')
                0                0      n/a  io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('python')
      1.45±0.01ms      1.56±0.06ms     1.07  io.csv.ReadCSVParseDates.time_baseline('c')
      1.58±0.03ms      1.57±0.02ms     0.99  io.csv.ReadCSVParseDates.time_baseline('python')
      1.78±0.01ms      1.81±0.06ms     1.01  io.csv.ReadCSVParseDates.time_multiple_date('c')
      1.99±0.03ms       2.30±0.3ms    ~1.15  io.csv.ReadCSVParseDates.time_multiple_date('python')
      2.99±0.03ms      2.93±0.02ms     0.98  io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'c')
       13.3±0.1ms       13.2±0.2ms     0.99  io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'python')
       7.70±0.2ms      7.54±0.07ms     0.98  io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'c')
       40.0±0.8ms         39.9±2ms     1.00  io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'python')
      3.41±0.05ms      3.42±0.03ms     1.00  io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'c')
       14.2±0.8ms       13.9±0.6ms     0.98  io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'python')
       10.3±0.1ms       11.0±0.2ms     1.06  io.csv.ReadCSVSkipRows.time_skipprows(10000, 'c')
       8.88±0.6ms       8.13±0.3ms     0.92  io.csv.ReadCSVSkipRows.time_skipprows(10000, 'pyarrow')
         46.3±1ms         46.4±2ms     1.00  io.csv.ReadCSVSkipRows.time_skipprows(10000, 'python')
       14.8±0.4ms       15.0±0.3ms     1.02  io.csv.ReadCSVSkipRows.time_skipprows(None, 'c')
       8.15±0.6ms       8.34±0.3ms     1.02  io.csv.ReadCSVSkipRows.time_skipprows(None, 'pyarrow')
         68.0±1ms         64.1±2ms     0.94  io.csv.ReadCSVSkipRows.time_skipprows(None, 'python')
       14.1±0.3ms       14.5±0.3ms     1.03  io.csv.ReadCSVThousands.time_thousands(',', ',', 'c')
          166±4ms          161±3ms     0.97  io.csv.ReadCSVThousands.time_thousands(',', ',', 'python')
       11.4±0.2ms       11.6±0.2ms     1.02  io.csv.ReadCSVThousands.time_thousands(',', None, 'c')
       58.4±0.9ms         58.0±1ms     0.99  io.csv.ReadCSVThousands.time_thousands(',', None, 'python')
       13.4±0.2ms       13.8±0.2ms     1.02  io.csv.ReadCSVThousands.time_thousands('|', ',', 'c')
          168±3ms          164±6ms     0.98  io.csv.ReadCSVThousands.time_thousands('|', ',', 'python')
       11.3±0.2ms       11.8±0.4ms     1.04  io.csv.ReadCSVThousands.time_thousands('|', None, 'c')
         59.2±1ms       57.7±0.9ms     0.97  io.csv.ReadCSVThousands.time_thousands('|', None, 'python')
       3.38±0.6ms       3.55±0.2ms     1.05  io.csv.ReadUint64Integers.time_read_uint64
       5.53±0.2ms       5.67±0.2ms     1.03  io.csv.ReadUint64Integers.time_read_uint64_na_values
       5.62±0.4ms       5.43±0.2ms     0.96  io.csv.ReadUint64Integers.time_read_uint64_neg_values
          114±2ms          112±1ms     0.98  io.csv.ToCSV.time_frame('long')
       16.5±0.2ms       16.3±0.2ms     0.99  io.csv.ToCSV.time_frame('mixed')
         91.3±1ms         94.2±3ms     1.03  io.csv.ToCSV.time_frame('wide')
       7.71±0.4ms       7.79±0.3ms     1.01  io.csv.ToCSVDatetime.time_frame_date_formatting
      7.15±0.06ms      7.26±0.08ms     1.02  io.csv.ToCSVDatetimeBig.time_frame(1000)
         66.7±1ms       66.7±0.6ms     1.00  io.csv.ToCSVDatetimeBig.time_frame(10000)
          671±5ms          668±6ms     1.00  io.csv.ToCSVDatetimeBig.time_frame(100000)
          380±2ms          379±7ms     1.00  io.csv.ToCSVDatetimeIndex.time_frame_date_formatting_index
          142±1ms          142±2ms     1.00  io.csv.ToCSVDatetimeIndex.time_frame_date_no_format_index
          711±9ms          715±9ms     1.01  io.csv.ToCSVIndexes.time_head_of_multiindex
          721±5ms          718±6ms     1.00  io.csv.ToCSVIndexes.time_multiindex
          580±8ms          579±6ms     1.00  io.csv.ToCSVIndexes.time_standard_index
          238±5ms          234±4ms     0.98  io.csv.ToCSVMultiIndexUnusedLevels.time_full_frame
       21.5±0.1ms       22.1±0.3ms     1.03  io.csv.ToCSVMultiIndexUnusedLevels.time_single_index_frame
       22.5±0.5ms       22.6±0.2ms     1.00  io.csv.ToCSVMultiIndexUnusedLevels.time_sliced_frame
         798±60ms         747±10ms     0.94  io.excel.ReadExcel.time_read_excel('odf')
         196±10ms          180±3ms     0.92  io.excel.ReadExcel.time_read_excel('openpyxl')
         43.5±2ms         41.0±1ms     0.94  io.excel.ReadExcel.time_read_excel('xlrd')
         439±20ms          420±7ms     0.96  io.excel.WriteExcel.time_write_excel('openpyxl')
         233±10ms         230±10ms     0.99  io.excel.WriteExcel.time_write_excel('xlsxwriter')
          220±8ms          227±4ms     1.03  io.excel.WriteExcel.time_write_excel('xlwt')

I added a simple asv benchmark that triggers the relevant check. It is also not affected at all

       before           after         ratio
     [c355145c]       [994a6345]
     <raise-on-parse-int-overflow~7>       <raise-on-parse-int-overflow>
       1.34±0.1ms      1.31±0.04ms     0.98  io.csv.ReadUint8Integers.time_read_uint8

A separate timeit on the comparison (arr1 == arr2).all() shows that it takes ~5µs compared to ~1ms of the total read_csv.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional remarks: For running the asvs in io.csv I had to:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel should _dtype_can_hold_range be used here to check int64 lossless conversion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype_can_hold_range is specific to range objects. looks like we have an ndarray here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, comparing ndarrays here. I could change it to np.array_equal for readability, but apart from some safety boilerplate the same check is performed there: https://github.com/numpy/numpy/blob/50a74fb65fc752e77a2f9e9e2b7227629c2ba953/numpy/core/numeric.py#L2468

result = casted
else:
raise TypeError(
f"cannot safely cast non-equivalent {result.dtype} to {dtype}"
)

return result, na_count

Expand Down
8 changes: 6 additions & 2 deletions pandas/io/parsers/base_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.astype import astype_nansafe
from pandas.core.dtypes.cast import maybe_cast_to_integer_array
from pandas.core.dtypes.common import (
ensure_object,
is_bool_dtype,
Expand Down Expand Up @@ -844,8 +845,11 @@ def _cast_types(self, values: ArrayLike, cast_type: DtypeObj, column) -> ArrayLi
values = values.astype(cast_type, copy=False)
else:
try:
values = astype_nansafe(values, cast_type, copy=True, skipna=True)
except ValueError as err:
if is_integer_dtype(cast_type):
values = maybe_cast_to_integer_array(values, cast_type, copy=True)
else:
values = astype_nansafe(values, cast_type, copy=True, skipna=True)
except (ValueError, OverflowError) as err:
raise ValueError(
f"Unable to convert column {column} to type {cast_type}"
) from err
Expand Down
99 changes: 98 additions & 1 deletion pandas/tests/io/parser/common/test_ints.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@
Series,
)
import pandas._testing as tm
from pandas.api.types import (
is_extension_array_dtype,
is_unsigned_integer_dtype,
pandas_dtype,
)

# GH#43650: Some expected failures with the pyarrow engine can occasionally
# cause a deadlock instead, so we skip these instead of xfailing
Expand Down Expand Up @@ -110,6 +115,98 @@ def test_integer_overflow_bug(all_parsers, sep):
tm.assert_frame_equal(result, expected)


def _iinfo(dtype):
pdtype = pandas_dtype(dtype)
iinfo = np.iinfo(pdtype.type if is_extension_array_dtype(dtype) else pdtype)
return iinfo


@skip_pyarrow
@pytest.mark.parametrize(
"getval",
[
(lambda dtype: _iinfo(dtype).max),
(lambda dtype: _iinfo(dtype).min),
],
)
def test_integer_limits_with_user_dtype(all_parsers, any_int_dtype, getval):
dtype = any_int_dtype
parser = all_parsers
val = getval(dtype)
data = f"A\n{val}"

result = parser.read_csv(StringIO(data), dtype=dtype)
expected_result = DataFrame({"A": [val]}, dtype=dtype)
tm.assert_frame_equal(result, expected_result)


@skip_pyarrow
@pytest.mark.parametrize(
"getval",
[
(lambda dtype: _iinfo(dtype).max + 1),
(lambda dtype: _iinfo(dtype).min - 1),
],
)
def test_integer_overflow_with_user_dtype(all_parsers, any_int_dtype, getval):
# see GH-47167
dtype = any_int_dtype
parser = all_parsers
val = getval(dtype)
data = f"A\n{val}"

expected = pytest.raises( # noqa: PDF010
(OverflowError, TypeError, ValueError),
match="|".join(
[
"Overflow",
"cannot safely cast non-equivalent",
"Integer out of range",
"Unable to convert column",
"The elements provided in the data cannot all be casted to the dtype",
]
),
)

# Specific case has intended behavior only after deprecation from #41734 becomes
# enforced. Until then, only expect a FutureWarning.
if (
(parser.engine == "python")
and (not is_extension_array_dtype(dtype))
and (dtype < np.dtype("int64"))
and not (is_unsigned_integer_dtype(dtype) and (val < 0))
):
expected = tm.assert_produces_warning(
FutureWarning,
match=f"Values are too large to be losslessly cast to {np.dtype(dtype)}.",
check_stacklevel=False,
)

with expected:
parser.read_csv(StringIO(data), dtype=dtype)


@skip_pyarrow
def test_integer_from_float_lossless(all_parsers, any_int_dtype):
dtype = any_int_dtype
parser = all_parsers
data = "A\n0\n0.0"

result = parser.read_csv(StringIO(data), dtype=dtype)
expected_result = DataFrame({"A": [0, 0]}, dtype=dtype)
tm.assert_frame_equal(result, expected_result)


@skip_pyarrow
def test_integer_from_float_lossy(all_parsers, any_int_dtype):
dtype = any_int_dtype
parser = all_parsers
data = "A\n0\n0.1"

with pytest.raises((TypeError, ValueError), match=None):
parser.read_csv(StringIO(data), dtype=dtype)


def test_int64_min_issues(all_parsers):
# see gh-2599
parser = all_parsers
Expand Down Expand Up @@ -170,7 +267,7 @@ def test_int64_overflow(all_parsers, conv):
)
def test_int64_uint64_range(all_parsers, val):
# These numbers fall right inside the int64-uint64
# range, so they should be parsed as string.
# range, so they should be parsed as integer value.
parser = all_parsers
result = parser.read_csv(StringIO(str(val)), header=None)

Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/io/parser/test_read_fwf.py
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,7 @@ def test_variable_width_unicode():
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("dtype", [{}, {"a": "float64", "b": str, "c": "int32"}])
@pytest.mark.parametrize("dtype", [{}, {"a": "float64", "b": str, "c": "float16"}])
def test_dtype(dtype):
data = """ a b c
1 2 3.2
Expand Down
24 changes: 23 additions & 1 deletion pandas/tests/io/parser/test_textreader.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,12 @@
import pandas._libs.parsers as parser
from pandas._libs.parsers import TextReader

from pandas import DataFrame
from pandas import (
DataFrame,
array,
)
import pandas._testing as tm
from pandas.api.types import is_extension_array_dtype

from pandas.io.parsers import (
TextFileReader,
Expand Down Expand Up @@ -125,6 +129,24 @@ def test_integer_thousands_alt(self):
expected = DataFrame([123456, 12500])
tm.assert_frame_equal(result, expected)

def test_integer_overflow_with_user_dtype(self, any_int_dtype):
dtype = ensure_dtype_objs(any_int_dtype)
is_ext_dtype = is_extension_array_dtype(dtype)
maxint = np.iinfo(dtype.type if is_ext_dtype else dtype).max

reader = TextReader(StringIO(f"{maxint}"), header=None, dtype=dtype)
result = reader.read()
if is_ext_dtype:
expected = array([maxint], dtype=dtype)
tm.assert_extension_array_equal(result[0], expected)
else:
expected = np.array([maxint], dtype=dtype)
tm.assert_numpy_array_equal(result[0], expected)

reader = TextReader(StringIO(f"{maxint + 1}"), header=None, dtype=dtype)
with pytest.raises((OverflowError, TypeError, ValueError), match=None):
reader.read()

def test_skip_bad_lines(self, capsys):
# too many lines, see #2430 for why
data = "a:b:c\nd:e:f\ng:h:i\nj:k:l:m\nl:m:n\no:p:q:r"
Expand Down