-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Don't overflow in DataFrame init #18624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Don't overflow in DataFrame init #18624
Conversation
I assume it'll be fine, but could you do a quick performance check on ints smaller than int64 max to make sure things look OK? |
3a53c7d
to
9d39610
Compare
@@ -262,4 +262,4 @@ Other | |||
- Fixed a bug where creating a Series from an array that contains both tz-naive and tz-aware values will result in a Series whose dtype is tz-aware instead of object (:issue:`16406`) | |||
- Fixed construction of a :class:`Series` from a ``dict`` containing ``NaN`` as key (:issue:`18480`) | |||
- Adding a ``Period`` object to a ``datetime`` or ``Timestamp`` object will now correctly raise a ``TypeError`` (:issue:`17983`) | |||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to conversion (prob should move some of the other ones appropriately as well) but other PR for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
pandas/_libs/src/inference.pyx
Outdated
@@ -1263,7 +1263,7 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0, | |||
if not seen.null_: | |||
seen.saw_int(int(val)) | |||
|
|||
if seen.uint_ and seen.sint_: | |||
if (seen.uint_ and seen.sint_) or val > oUINT64_MAX: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update the doc-string as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
9d39610
to
f9c8991
Compare
@TomAugspurger : Didn't seen any perf degradation, which makes sense since the changes are just new |
f9c8991
to
5f5d332
Compare
Codecov Report
@@ Coverage Diff @@
## master #18624 +/- ##
==========================================
- Coverage 91.6% 91.58% -0.02%
==========================================
Files 153 153
Lines 51253 51253
==========================================
- Hits 46950 46941 -9
- Misses 4303 4312 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18624 +/- ##
==========================================
+ Coverage 91.58% 91.58% +<.01%
==========================================
Files 153 153
Lines 51250 51250
==========================================
+ Hits 46935 46938 +3
+ Misses 4315 4312 -3
Continue to review full report at Codecov.
|
All green. @TomAugspurger @jreback PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also check the case of large negative integers ?
def test_constructor_overflow_uint64(self): | ||
# see gh-18584 | ||
values = np.array([2**64], dtype=object) | ||
result = DataFrame(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test already passes on master, it was when you didn't already have an object array that it failed:
In [6]: pd.DataFrame(np.array([2**64], dtype=object))
Out[6]:
0
0 18446744073709551616
In [7]: pd.DataFrame([2**64])
...
OverflowError: Python int too large to convert to C unsigned long
(but keep also this one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also check the case of large negative integers ?
We do now. 😄
Added new tests as well.
5f5d332
to
30cee48
Compare
30cee48
to
825bdcf
Compare
@@ -196,8 +196,10 @@ def test_constructor_overflow_int64(self): | |||
assert df_crawls['uid'].dtype == np.uint64 | |||
|
|||
@pytest.mark.parametrize("values", [np.array([2**64], dtype=object), | |||
np.array([2**64]), [2**64]]) | |||
def test_constructor_overflow_uint64(self, values): | |||
np.array([2**65]), [2**64 + 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.array([2**65])
should be the same as the np.array([2**64], dtype=object)
case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be but just testing out different values and initialization of large data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean
In [70]: np.array([2**64], dtype=object)
Out[70]: array([18446744073709551616], dtype=object)
In [71]: np.array([2**64])
Out[71]: array([18446744073709551616], dtype=object)
is twice the same you test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No? They're two different values. Just trying out two different large numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry, didn't see the difference in the power term
structure lgtm. I think @jorisvandenbossche has a comment. |
For integers larger than what uint64 can handle, we gracefully default to the object dtype instead of overflowing. Closes pandas-devgh-18584.
For integers smaller than what int64 can handle, we gracefully default to the object dtype instead of overflowing.
825bdcf
to
9d5abd3
Compare
@jreback @jorisvandenbossche @TomAugspurger : All comments have been addressed, and all is green. PTAL. |
Thanks! |
For integers larger than what uint64 can handle (or smaller than what int64 can handle), we gracefully default to the object dtype instead of overflowing.
Closes #18584.