Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BUG: Raise on parse int overflow #47167 #47168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Raise on parse int overflow #47167 #47168
Changes from 10 commits
8d0efca
27629f0
bfb0b89
661c853
ccb6f61
a3b458a
994a634
567fd58
43bcb22
61a36a5
927ddad
d992af5
498b93d
9e1fbbc
ef91ab5
270eb90
dd5cd0e
64047b4
d812c32
b1f83b9
8f4c947
a935ac9
0c9f4e8
8353cba
60b3018
3e5f929
ba40923
3d72cf2
3f39a5b
7cf208f
520cae3
88d8650
d96d6b0
3508a9f
2c16f74
485dcfc
5896e01
b276196
a1a6764
c9e8a92
39b5c91
92fab59
f653e96
8b406ab
8d782fb
09e6773
a545602
0439322
b392f32
fca428c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pattern here is to _try_dtype then if that fails try another one, can you just do that here (rather than all of this if/then logic).
e.g. add a
_try_uint64
is neededThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to stay with the pattern
_try_int64
-> if fail ->_try_uint64
, but shortcut the two cases withuser_dtype
anddtype in ["int64", "uint64"]
, where we can fail after the specific _try_dtype. I added another_try_uint64
instead of thedo_try_uint64
. Is that okay?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I agree with @jreback, this is hard to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree and added a
_try_uint64
in commit 927ddad like @jreback suggested. Were you also referring to the former version withdo_try_uint64
@phofl ? The latest version is https://github.com/SandroCasagrande/pandas/blob/5896e017ca2960e0d535c8c0a0b9db978377bc91/pandas/_libs/parsers.pyx#L1182-L1198. Sorry, if I did not signify correctly that I performed changes and the newest version should be reviewed again. Can you please have a look @jreback or @phofl?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks expensive. Can you run asvs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks. I thought the same thing, when I saw exactly this check being applied for parsing (and more generally setting) with the nullable integer extension dtypes
pandas/pandas/core/arrays/integer.py
Lines 53 to 55 in 3bf2cb1
Just to clarify: The additional check is never performed when parsing with automatically inferred dtype (since TextReader.dtype_cast_order contains only int64 and no other integer dtypes), nor with user specified dtype int64. I just comitted another change that prevents unnecessarily running into this check when parsing with user specified dtype uint64.
The check then only affects parsing with user specified integer dtype of size < 64 bit. As far as I can see, none of the existing asvs covers this.
Just to be sure I ran some existing asvs that seemed most relevant and found no significant change:
I added a simple asv benchmark that triggers the relevant check. It is also not affected at all
A separate timeit on the comparison
(arr1 == arr2).all()
shows that it takes ~5µs compared to ~1ms of the totalread_csv
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional remarks: For running the asvs in io.csv I had to:
min_cython_ver
in setup.py)ReadCSVIndexCol.time_read_csv_index_col
(see 61a36a5)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel should
_dtype_can_hold_range
be used here to check int64 lossless conversion?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype_can_hold_range is specific to range objects. looks like we have an ndarray here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, comparing ndarrays here. I could change it to np.array_equal for readability, but apart from some safety boilerplate the same check is performed there: https://github.com/numpy/numpy/blob/50a74fb65fc752e77a2f9e9e2b7227629c2ba953/numpy/core/numeric.py#L2468