-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API/BUG: Handling Dtype Coercions in Series/Index (GH 15832) #15859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 23 commits
676a4e5
e12bca7
9fc617b
8b463cb
43456a5
faa5c5c
1c90e7e
278c2fb
a8cd752
bbdea4b
d2e26ac
3c868a4
20ac5c6
14ed83b
3d0e76f
1726408
1f8e9b7
83cfc5d
417188a
939ae11
86e9d5e
359086d
50950f5
012fb57
35a5ff1
b1e6632
a1033cb
b78f4cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ | |
is_timedelta64_dtype, is_dtype_equal, | ||
is_float_dtype, is_complex_dtype, | ||
is_integer_dtype, | ||
is_unsigned_integer_dtype, | ||
is_datetime_or_timedelta_dtype, | ||
is_bool_dtype, is_scalar, | ||
_string_dtypes, | ||
|
@@ -1026,3 +1027,37 @@ def find_common_type(types): | |
return np.object | ||
|
||
return np.find_common_type(types, []) | ||
|
||
|
||
def maybe_cast_to_integer(arr, dtype): | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually can take any dtype and will return a casted version. It happens to check for integer/unsigned integer dtypes. So pls expand a little. |
||
Takes an integer dtype and returns the casted version, raising for an | ||
incompatible dtype. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionadded tag (even though this is private, nice to have) |
||
|
||
Parameters | ||
---------- | ||
arr : ndarray | ||
dtype : np.dtype | ||
|
||
Returns | ||
------- | ||
integer or unsigned integer array | ||
|
||
Raises | ||
------ | ||
OverflowError | ||
* If ``dtype`` is incompatible | ||
ValueError | ||
* If coercion from float to integer loses precision | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the spirit of good documentation, let's add some examples here! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Excellent! Nice examples. |
||
""" | ||
|
||
if is_unsigned_integer_dtype(dtype) and (arr < 0).any(): | ||
raise OverflowError("Trying to coerce negative values to negative " | ||
"integers") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this then needs to cast, no? e.g. (before the return)
|
||
elif is_integer_dtype(dtype) and (is_float_dtype(arr) or | ||
is_object_dtype(arr)): | ||
if not (arr == arr.astype(dtype)).all(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do
a bit more code but then avoid coercion twice. |
||
raise ValueError("Trying to coerce float values to integers") | ||
|
||
return arr.astype(dtype, copy=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think you should cast here. This will coerce non-integer arrays to the passed dtype, which may not be nice. instead branch on each of the if's (the integer and unsigned cases) and do an astype or raise). so this will then explicity only work on integer dtypes that are passed, and not implicity on anything else |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,6 +15,7 @@ | |
|
||
from pandas.core.dtypes.generic import ABCSeries, ABCMultiIndex, ABCPeriodIndex | ||
from pandas.core.dtypes.missing import isnull, array_equivalent | ||
from pandas.core.dtypes.cast import maybe_cast_to_integer | ||
from pandas.core.dtypes.common import ( | ||
_ensure_int64, | ||
_ensure_object, | ||
|
@@ -212,11 +213,14 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |
if is_integer_dtype(dtype): | ||
inferred = lib.infer_dtype(data) | ||
if inferred == 'integer': | ||
data = maybe_cast_to_integer(data, dtype=dtype) | ||
data = np.array(data, copy=copy, dtype=dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we don't need the |
||
elif inferred in ['floating', 'mixed-integer-float']: | ||
if isnull(data).any(): | ||
raise ValueError('cannot convert float ' | ||
'NaN to integer') | ||
if inferred == 'mixed-integer-float': | ||
maybe_cast_to_integer(data, dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this need to be assigned? |
||
|
||
# If we are actually all equal to integers, | ||
# then coerce to integer. | ||
|
@@ -248,6 +252,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |
msg = str(e) | ||
if 'cannot convert float' in msg: | ||
raise | ||
if 'Trying to coerce float values to integer' in msg: | ||
raise | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. elif There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not just do an |
||
|
||
# maybe coerce to a sub-class | ||
from pandas.core.indexes.period import ( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -35,7 +35,8 @@ | |
from pandas.core.dtypes.cast import ( | ||
maybe_upcast, infer_dtype_from_scalar, | ||
maybe_convert_platform, | ||
maybe_cast_to_datetime, maybe_castable) | ||
maybe_cast_to_datetime, maybe_castable, | ||
maybe_cast_to_integer) | ||
from pandas.core.dtypes.missing import isnull, notnull | ||
|
||
from pandas.core.common import (is_bool_indexer, | ||
|
@@ -2916,9 +2917,13 @@ def _try_cast(arr, take_fast_path): | |
return arr | ||
|
||
try: | ||
if is_float_dtype(dtype) or is_integer_dtype(dtype): | ||
subarr = maybe_cast_to_integer(np.asarray(arr), dtype) | ||
|
||
subarr = maybe_cast_to_datetime(arr, dtype) | ||
if not is_extension_type(subarr): | ||
subarr = np.array(subarr, dtype=dtype, copy=copy) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am going to move this entire section (e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you want to move part/all of this ok with that as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
except (ValueError, TypeError): | ||
if is_categorical_dtype(dtype): | ||
subarr = Categorical(arr) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -304,6 +304,18 @@ def test_astype(self): | |
i = Float64Index([0, 1.1, np.NAN]) | ||
pytest.raises(ValueError, lambda: i.astype(dtype)) | ||
|
||
# GH 15832 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. put this in a separate test |
||
for t in ['uint8', 'uint16', 'uint32', 'uint64', 'int32', 'int64', | ||
'int16', 'int8']: | ||
with pytest.raises(ValueError): | ||
Index([1, 2, 3.5], dtype=t) | ||
|
||
try: | ||
for t in ['float16', 'float32']: | ||
Index([1, 2, 3.5], dtype=t) | ||
except ValueError: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't need to catch the exception |
||
pytest.fail("GH 15832 should not raise for float type") | ||
|
||
def test_equals_numeric(self): | ||
|
||
i = Float64Index([1.0, 2.0]) | ||
|
@@ -678,6 +690,12 @@ def test_constructor_corner(self): | |
with tm.assert_raises_regex(TypeError, 'casting'): | ||
Int64Index(arr_with_floats) | ||
|
||
def test_constructor_overflow_coercion_signed_to_unsigned(self): | ||
# GH 15832 | ||
for t in ['uint8', 'uint16', 'uint32', 'uint64']: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's utilize There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ucals : This one has not yet been fully addressed (see |
||
with pytest.raises(OverflowError): | ||
Index([-1], dtype=t) | ||
|
||
def test_coerce_list(self): | ||
# coerce things | ||
arr = Index([1, 2, 3, 4]) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,26 @@ | ||
# coding=utf-8 | ||
# pylint: disable-msg=E1101,W0612 | ||
|
||
import pytest | ||
|
||
from datetime import datetime, timedelta | ||
|
||
from numpy import nan | ||
import numpy as np | ||
import numpy.ma as ma | ||
import pandas as pd | ||
|
||
from pandas.core.dtypes.common import ( | ||
is_categorical_dtype, | ||
is_datetime64tz_dtype) | ||
import pytest | ||
from numpy import nan | ||
from pandas import (Index, Series, isnull, date_range, | ||
NaT, period_range, MultiIndex, IntervalIndex) | ||
from pandas.core.indexes.datetimes import Timestamp, DatetimeIndex | ||
from pandas import compat | ||
from pandas.compat import lrange, range, zip, OrderedDict, long | ||
|
||
import pandas.util.testing as tm | ||
from pandas._libs import lib | ||
from pandas._libs.tslib import iNaT | ||
|
||
from pandas.compat import lrange, range, zip, OrderedDict, long | ||
from pandas import compat | ||
from pandas.core.dtypes.common import ( | ||
is_categorical_dtype, | ||
is_datetime64tz_dtype) | ||
from pandas.core.indexes.datetimes import Timestamp, DatetimeIndex | ||
from pandas.util.testing import assert_series_equal | ||
import pandas.util.testing as tm | ||
|
||
from .common import TestData | ||
|
||
|
||
|
@@ -303,9 +299,27 @@ def test_constructor_pass_nan_nat(self): | |
def test_constructor_cast(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test name isn't particularly informative. Let's break this test up so that we can check for specific errors and also utilize @pytest.mark.parametrize(...)
def test_constructor_unsigned_dtype_overflow(self):
...
@pytest.mark.parametrize(...)
def test_constructor_coerce_float_fail(self:
... Also, I prefer if we can use pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float) I can begin to see / understand why it fails, but what is the exact reason where it is breaking? If it is separate from the ones you added, let's make that a test of its own. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ucals : You broke up the tests, but your new tests still use |
||
pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float) | ||
|
||
# GH 15832 | ||
for t in ['uint8', 'uint16', 'uint32', 'uint64']: | ||
with pytest.raises(OverflowError): | ||
Series([-1], dtype=t) | ||
|
||
# GH 15832 | ||
for t in ['uint8', 'uint16', 'uint32', 'uint64', 'int32', 'int64', | ||
'int16', 'int8']: | ||
with pytest.raises(ValueError): | ||
Series([1, 2, 3.5], dtype=t) | ||
|
||
try: | ||
for t in ['float16', 'float32']: | ||
Series([1, 2, 3.5], dtype=t) | ||
except ValueError: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
pytest.fail("GH 15832 should not raise for float type") | ||
|
||
def test_constructor_dtype_nocast(self): | ||
# 1572 | ||
s = Series([1, 2, 3]) | ||
s = Series([1, 2, 3]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dupe? |
||
|
||
s2 = Series(s, dtype=np.int64) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to
maybe_cast_to_integer_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : Perhaps we could just generalize to accept scalar as well as
array
(as consistent with other casting functions in this module) and keep the name as is?