-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API/BUG: Handling Dtype Coercions in Series/Index (GH 15832) #15859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -1032,8 +1032,7 @@ Reshaping | |||
- Bug in ``pd.concat()`` in which concatting with an empty dataframe with ``join='inner'`` was being improperly handled (:issue:`15328`) | |||
- Bug with ``sort=True`` in ``DataFrame.join`` and ``pd.merge`` when joining on indexes (:issue:`15582`) | |||
|
|||
Numeric | |||
^^^^^^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls cleanly add things (don't delete)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. None of the changes should be in your PR given what you are focusing on. Don't forget to add ones for your actual patches though!
pandas/core/series.py
Outdated
@@ -2809,6 +2810,16 @@ def _try_cast(arr, take_fast_path): | |||
subarr = maybe_cast_to_datetime(arr, dtype) | |||
if not is_extension_type(subarr): | |||
subarr = np.array(subarr, dtype=dtype, copy=copy) | |||
|
|||
# Raises if coersion from unsigned to signed with neg data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use is_unsigned_integer_dtype
and is_float_dtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, coercion
instead of coersion
(you do this later on as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any changes though?
pandas/core/series.py
Outdated
raise OverflowError | ||
|
||
# Raises if coersion from float to int with fraction data | ||
inferred_type, _ = infer_dtype_from_array(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just check the dtype you don't need to infer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/series.py
Outdated
@@ -2809,6 +2810,16 @@ def _try_cast(arr, take_fast_path): | |||
subarr = maybe_cast_to_datetime(arr, dtype) | |||
if not is_extension_type(subarr): | |||
subarr = np.array(subarr, dtype=dtype, copy=copy) | |||
|
|||
# Raises if coersion from unsigned to signed with neg data | |||
if dtype == np.dtype('uint') and len(subarr[subarr < 0]) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(subarr]<0).any()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -831,3 +831,20 @@ def test_constructor_cast_object(self): | |||
s = Series(date_range('1/1/2000', periods=10), dtype=object) | |||
exp = Series(date_range('1/1/2000', periods=10)) | |||
tm.assert_series_equal(s, exp) | |||
|
|||
def test_overflow_coersion_signed_to_unsigned(self): | |||
Series([-1], dtype='uint8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
pytest.raises(OverflowError):
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific with regards to what @jreback said, here's an example:
with pytest.raises(OverflowError):
Series([-1], dtype='uint16')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any changes though?
pandas/core/series.py
Outdated
# Raises if coersion from unsigned to signed with neg data | ||
if dtype == np.dtype('uint') and len(subarr[subarr < 0]) > 0: | ||
raise OverflowError | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add helpful messages to these raises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Seems like you were able to successfully do that. Also, don't forget to make similar patches to |
This is not necessary. We will squash when merging. If it makes it cleaner for you, then sure, |
@ucals did you push your changes? |
Ops, forgot to push.... Pushing it right away, 1 sec |
pandas/core/series.py
Outdated
if not is_list_like(d2): | ||
d2 = [data] | ||
# Raises if coercion from unsigned to signed with neg data | ||
if is_unsigned_integer_dtype(dtype) and any(x < 0 for x in d2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is python any and will be non-performant, use what I gave before
(d2<0).any()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done... but this makes the test break, leading to this error:
TypeError: '<' not supported between instances of 'list' and 'int'
pandas/core/series.py
Outdated
if is_unsigned_integer_dtype(dtype) and any(x < 0 for x in d2): | ||
raise OverflowError( | ||
"Trying to coerce negative values to unsigned " | ||
"integers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
negative integers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/series.py
Outdated
"integers") | ||
|
||
# Raises if coercion from float to int with fraction data | ||
if any([is_float_dtype(type(x)) for x in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a 1-d array, you don't need any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/series.py
Outdated
@@ -2809,6 +2810,22 @@ def _try_cast(arr, take_fast_path): | |||
subarr = maybe_cast_to_datetime(arr, dtype) | |||
if not is_extension_type(subarr): | |||
subarr = np.array(subarr, dtype=dtype, copy=copy) | |||
|
|||
d2 = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this section should be only if dtype
is not None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -2809,6 +2810,22 @@ def _try_cast(arr, take_fast_path): | |||
subarr = maybe_cast_to_datetime(arr, dtype) | |||
if not is_extension_type(subarr): | |||
subarr = np.array(subarr, dtype=dtype, copy=copy) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to move this entire section (e.g. _try_cast
) to pandas/types/cast
. but I will do that after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to move part/all of this ok with that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/tests/indexes/test_base.py
Outdated
@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self): | |||
pd.TimedeltaIndex(list(values), dtype=dtype)]: | |||
tm.assert_index_equal(res, idx) | |||
|
|||
def test_constructor_overflow_coersion_signed_to_unsigned(self): | |||
with pytest.raises(OverflowError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/tests/indexes/test_base.py
Outdated
@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self): | |||
pd.TimedeltaIndex(list(values), dtype=dtype)]: | |||
tm.assert_index_equal(res, idx) | |||
|
|||
def test_constructor_overflow_coersion_signed_to_unsigned(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the dtypes in a loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert that this works correct for int (1 dtype is ok)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/tests/indexes/test_base.py
Outdated
with pytest.raises(OverflowError): | ||
Index([-1], dtype='uint64') | ||
|
||
def test_constructor_overflow_coersion_float_to_int(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check all int & unsigned int dtypes (use a loop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert that this works for float
@@ -831,3 +832,20 @@ def test_constructor_cast_object(self): | |||
s = Series(date_range('1/1/2000', periods=10), dtype=object) | |||
exp = Series(date_range('1/1/2000', periods=10)) | |||
tm.assert_series_equal(s, exp) | |||
|
|||
def test_constructor_overflow_coersion_signed_to_unsigned(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put dtypes in a loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -236,7 +237,7 @@ def test_constructor_corner(self): | |||
tm.assertIsInstance(s, Series) | |||
|
|||
def test_constructor_sanitize(self): | |||
s = Series(np.array([1., 1., 8.]), dtype='i8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert that the original raises
see my comment at the end
This should actually work. We allow casting to a compat dtype when there is no loss of precision.
The counter example of course are: This should raises (as this loses precision).
This is good. (pls just confirm that our error messages are the same for Index/Series).
|
Hey @jreback , sorry for the past days, got overloaded in my work. But now I can resume the work on this PR. I'm having some difficulties with the code, can you help me? Specifically:
Can you help me? Pls let me know (I can submit the code as is now, all tests are ok and your other comments are ok as well) |
@ucals make sure everything is pushed up (looks like you updated comments, but code is older I think). I will take a look tomorrow. |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -1043,9 +1043,6 @@ Numeric | |||
- Bug in ``pandas.tools.utils.cartesian_product()`` with large input can cause overflow on windows (:issue:`15265`) | |||
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`) | |||
|
|||
Other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ucals these changes need to be reverted (aside from your 1 line addition)
pandas/indexes/base.py
Outdated
@@ -326,6 +326,22 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |||
pass | |||
# other iterable of some kind | |||
subarr = _asarray_tuplesafe(data, dtype=object) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these don't below here at all, you can put them in numeric
pandas/tests/indexes/test_base.py
Outdated
@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self): | |||
pd.TimedeltaIndex(list(values), dtype=dtype)]: | |||
tm.assert_index_equal(res, idx) | |||
|
|||
def test_constructor_overflow_coercion_signed_to_unsigned(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be tested in test_numeric.py
pandas/types/cast.py
Outdated
@@ -983,3 +984,60 @@ def find_common_type(types): | |||
return np.object | |||
|
|||
return np.find_common_type(types, []) | |||
|
|||
|
|||
def try_cast(arr, take_fast_path, data, dtype=None, copy=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signature should be
def try_cast(arr, dtype=None, copy=False, raise_cast_failure=False, fastpath=True)
pandas/types/cast.py
Outdated
def try_cast(arr, take_fast_path, data, dtype=None, copy=False, | ||
raise_cast_failure=False): | ||
""" try casting the array using dtype. Used in _sanitize_array procedure | ||
at Series initialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to maybe_cast_to_array
pandas/types/cast.py
Outdated
try: | ||
subarr = maybe_cast_to_datetime(arr, dtype) | ||
if not is_extension_type(subarr): | ||
subarr = np.array(subarr, dtype=dtype, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simply return if its an extension_type
pandas/types/cast.py
Outdated
subarr = np.array(subarr, dtype=dtype, copy=copy) | ||
|
||
if dtype is not None: | ||
d2 = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need any of this is_list_like stuff, by definition this is already an array
pandas/types/cast.py
Outdated
if copy or not is_integer_dtype(data): | ||
assert_safe_casting(data, subarr) | ||
|
||
if any([is_float_dtype(type(x)) for x in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this is much simpler, it is already an array
pandas/types/cast.py
Outdated
raise OverflowError( | ||
"Trying to coerce float values to integers") | ||
|
||
except (ValueError, TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part of the except you can simply put around the subarr = np.array(subarr, dtype=dtype, copy=copy)
above.
The dtype checks need only occur after this.
pandas/types/cast.py
Outdated
def assert_safe_casting(data, subarr): | ||
""" | ||
Ensure incoming data can be represented as ints. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this routine is all over the place (in Index), will want to simply call this one.
pandas/core/series.py
Outdated
@@ -2827,19 +2810,19 @@ def _try_cast(arr, take_fast_path): | |||
# possibility of nan -> garbage | |||
if is_float_dtype(data.dtype) and is_integer_dtype(dtype): | |||
if not isnull(data).any(): | |||
subarr = _try_cast(data, True) | |||
subarr = try_cast(data, True, data=data, dtype=dtype, copy=copy, raise_cast_failure=raise_cast_failure) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lines are way too long so the linter will complain
pandas/indexes/base.py
Outdated
@@ -326,6 +326,22 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |||
pass | |||
# other iterable of some kind | |||
subarr = _asarray_tuplesafe(data, dtype=object) | |||
|
|||
d2 = data | |||
if not is_list_like(d2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need any of this is_list_like
subarr
is already an array.
pandas/indexes/base.py
Outdated
d2 = data | ||
if not is_list_like(d2): | ||
d2 = [data] | ||
# Raises if coercion from unsigned to signed with neg data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 routines can actually live in pandas.types.cast
, esentially all of this boils down to calling.
assert_safe_casting(subarr, dtype)
pandas/types/cast.py
Outdated
|
||
|
||
def assert_safe_casting(data, subarr): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a doc-string. the signature should be
assert_safe_casting(arr, dtype)
Hey @jreback , after trying and changing a lot over the past days, I'm starting over, fresh from master... will do a push today |
can you rebase / update |
Sure, @jreback , I will push current status tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot change tests that don't work. there is still something odd with the logic. as I said there is a case that is allowed that you are raising on.
pandas/core/dtypes/cast.py
Outdated
|
||
def maybe_cast_to_integer(arr, dtype): | ||
""" | ||
Find a common data type among the given dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be more explicit about what this does. It takes an integer dtype and returns the casted version, raising for an incompat dtype.
pandas/core/dtypes/cast.py
Outdated
|
||
Parameters | ||
---------- | ||
arr : array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ndarray
np.dtype
pandas/core/dtypes/cast.py
Outdated
|
||
Returns | ||
------- | ||
integer or unsigned integer array (or raise if the dtype is incompatible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a Raises section
pandas/core/dtypes/cast.py
Outdated
|
||
""" | ||
|
||
if is_unsigned_integer_dtype(dtype) and (np.asarray(arr) < 0).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are ndarays, so you don't need the np.asarray(arr)
(though you can do it but it would be before any checks (it doesn't copy so its fine).
pandas/core/dtypes/cast.py
Outdated
|
||
if is_unsigned_integer_dtype(dtype) and (np.asarray(arr) < 0).any(): | ||
raise OverflowError("Trying to coerce negative values to negative " | ||
"integers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this then needs to cast, no?
e.g. (before the return)
arr = arr.astype(dtype, copy=False)
exp_index = pd.Index([0, 1, 2, 3, 1.1]) | ||
self._assert_setitem_index_conversion(obj, 1.1, exp_index, np.float64) | ||
with pytest.raises(OverflowError): | ||
exp_index = pd.Index([0, 1, 2, 3, 1.1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should not raise, rather the results coerce
@@ -373,8 +375,9 @@ def test_insert_index_int64(self): | |||
self._assert_insert_conversion(obj, 1, exp, np.int64) | |||
|
|||
# int + float -> float | |||
exp = pd.Index([1, 1.1, 2, 3, 4]) | |||
self._assert_insert_conversion(obj, 1.1, exp, np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, this is a legit operation
@@ -622,7 +625,8 @@ def test_where_series_int64(self): | |||
self._where_int64_common(pd.Series) | |||
|
|||
def test_where_index_int64(self): | |||
self._where_int64_common(pd.Index) | |||
with pytest.raises(OverflowError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, remove the tests changes here
pandas/tests/io/test_pytables.py
Outdated
result = expected.sort_index() | ||
tm.assert_series_equal(result, expected) | ||
with pytest.raises(OverflowError): | ||
df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? why are you having this raise. this is a legitimate operation. pls revert this.
|
||
def test_constructor_overflow_coercion_float_to_int(self): | ||
# GH 15832 | ||
with pytest.raises(OverflowError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, move with related tests & test all uint & int dtypes.
assert that this works with float32/float64.
Hey @jreback , thanks for the comments. I just implemented all your comments, with exception of "assert that this works with float32/float64". Can you pls show me how? |
If we are going to raise on [3], then assert that [2] does not.
|
Just double-checking: looking in the issue, I understood that |
would be weird if these didn't act the same |
Yes, that's what I thought. That's why we wrote in # GH 15832
for t in ['uint8', 'uint16', 'uint32', 'uint64']:
with pytest.raises(ValueError):
Series([1, 2, 3.5], dtype=t) This test passes. However, it was written in # check with mixed dtypes
df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))
for c in ['float32', 'float64', 'int32', 'int64', 'int16', 'int8']])) If I don't change this test to expect a ValueError for the 4 int types, it will never pass, because it's the same case. These 2 cases are the same. How to differentiate those? Thanks |
should also raise for alll ints as well
this is wrong, change it to something like this
|
@jreback , just implemented all changes, and all tests passed :) |
so, any feedback? |
pandas/core/indexes/base.py
Outdated
@@ -248,6 +252,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |||
msg = str(e) | |||
if 'cannot convert float' in msg: | |||
raise | |||
if 'Trying to coerce float values to integer' in msg: | |||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do an or
statement? That seems more compact.
pandas/tests/indexes/test_numeric.py
Outdated
@@ -304,6 +304,18 @@ def test_astype(self): | |||
i = Float64Index([0, 1.1, np.NAN]) | |||
pytest.raises(ValueError, lambda: i.astype(dtype)) | |||
|
|||
# GH 15832 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this in a separate test
pandas/tests/indexes/test_numeric.py
Outdated
try: | ||
for t in ['float16', 'float32']: | ||
Index([1, 2, 3.5], dtype=t) | ||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to catch the exception
rather compare against an expected index
try: | ||
for t in ['float16', 'float32']: | ||
Series([1, 2, 3.5], dtype=t) | ||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
def test_constructor_dtype_nocast(self): | ||
# 1572 | ||
s = Series([1, 2, 3]) | ||
s = Series([1, 2, 3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dupe?
looks pretty good |
pandas/tests/indexes/test_numeric.py
Outdated
@@ -678,6 +690,12 @@ def test_constructor_corner(self): | |||
with tm.assert_raises_regex(TypeError, 'casting'): | |||
Int64Index(arr_with_floats) | |||
|
|||
def test_constructor_overflow_coercion_signed_to_unsigned(self): | |||
# GH 15832 | |||
for t in ['uint8', 'uint16', 'uint32', 'uint64']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's utilize pytest.mark.parametrize
because we can now 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ucals : This one has not yet been fully addressed (see test_numeric.py
).
@@ -303,9 +299,27 @@ def test_constructor_pass_nan_nat(self): | |||
def test_constructor_cast(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test name isn't particularly informative. Let's break this test up so that we can check for specific errors and also utilize pytest.mark.parametrize
for cleaner tests.
@pytest.mark.parametrize(...)
def test_constructor_unsigned_dtype_overflow(self):
...
@pytest.mark.parametrize(...)
def test_constructor_coerce_float_fail(self:
...
Also, I prefer if we can use tm.assert_raises_regex
to specify specific error messages instead of just testing for the Exception
type. That's a stronger test IMO. The test above:
pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float)
I can begin to see / understand why it fails, but what is the exact reason where it is breaking? If it is separate from the ones you added, let's make that a test of its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ucals : You broke up the tests, but your new tests still use pytest.raises
when tm.assert_raises_regex
would be more useful.
* If ``dtype`` is incompatible | ||
ValueError | ||
* If coercion from float to integer loses precision | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spirit of good documentation, let's add some examples here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Nice examples.
pandas/tests/indexes/test_numeric.py
Outdated
Index([1, 2, 3.5], dtype=integers) | ||
|
||
i = Index([1, 2, 3.5], dtype=floats) | ||
assert i.equals(Index([1, 2, 3.5])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tm.assert_index_equal(i, Index([1, 2, 3.5]))
is better for testing purposes.
with tm.assert_raises_regex(OverflowError, msg): | ||
Series([-1], dtype=unsigned_integers) | ||
|
||
@pytest.mark.parametrize("integers", ['uint8', 'uint16', 'uint32', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use some more informative names like int_dtype
(and float_dtype
, etc. for tests where you parametrized on dtype
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to tighten up the guarantees of this function, even though its private. and be a bit more explicit.
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -54,7 +54,7 @@ Backwards incompatible API changes | |||
|
|||
Other API Changes | |||
^^^^^^^^^^^^^^^^^ | |||
|
|||
- Series and Index constructors now raises when data is incompatible with dtype (:issue:`15832`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with a passed dtype=
kwarg.
pandas/core/dtypes/cast.py
Outdated
def maybe_cast_to_integer(arr, dtype): | ||
""" | ||
Takes an integer dtype and returns the casted version, raising for an | ||
incompatible dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag (even though this is private, nice to have)
pandas/core/dtypes/cast.py
Outdated
|
||
|
||
def maybe_cast_to_integer(arr, dtype): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually can take any dtype and will return a casted version. It happens to check for integer/unsigned integer dtypes. So pls expand a little.
pandas/core/dtypes/cast.py
Outdated
"integers") | ||
elif is_integer_dtype(dtype) and (is_float_dtype(arr) or | ||
is_object_dtype(arr)): | ||
if not (arr == arr.astype(dtype)).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do
casted = arr.astype(dtype, copy=False)
if (arr == casted).all():
return casted
raise ValueError(......)
a bit more code but then avoid coercion twice.
pandas/core/indexes/base.py
Outdated
@@ -212,11 +213,14 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |||
if is_integer_dtype(dtype): | |||
inferred = lib.infer_dtype(data) | |||
if inferred == 'integer': | |||
data = maybe_cast_to_integer(data, dtype=dtype) | |||
data = np.array(data, copy=copy, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need the np.array
casting line now
pandas/core/dtypes/cast.py
Outdated
@@ -1026,3 +1027,52 @@ def find_common_type(types): | |||
return np.object | |||
|
|||
return np.find_common_type(types, []) | |||
|
|||
|
|||
def maybe_cast_to_integer(arr, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to maybe_cast_to_integer_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : Perhaps we could just generalize to accept scalar as well as array
(as consistent with other casting functions in this module) and keep the name as is?
pandas/core/indexes/base.py
Outdated
data = np.array(data, copy=copy, dtype=dtype) | ||
elif inferred in ['floating', 'mixed-integer-float']: | ||
if isnull(data).any(): | ||
raise ValueError('cannot convert float ' | ||
'NaN to integer') | ||
if inferred == 'mixed-integer-float': | ||
maybe_cast_to_integer(data, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be assigned?
pandas/core/indexes/base.py
Outdated
@@ -246,7 +250,9 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |||
|
|||
except (TypeError, ValueError) as e: | |||
msg = str(e) | |||
if 'cannot convert float' in msg: | |||
if 'cannot convert float' in msg or 'Trying to coerce ' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do
if ('cannot convert float' in msg' or
'.........' in msg):
raise
don't use \
if at all possible
pandas/core/dtypes/cast.py
Outdated
if not (arr == arr.astype(dtype)).all(): | ||
raise ValueError("Trying to coerce float values to integers") | ||
|
||
return arr.astype(dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should cast here. This will coerce non-integer arrays to the passed dtype, which may not be nice.
instead branch on each of the if's (the integer and unsigned cases) and do an astype or raise).
so this will then explicity only work on integer dtypes that are passed, and not implicity on anything else
can you rebase and update? |
Hey @jreback , sorry for the delay, a lot to do in work until end of Q2. I believe I will be able to look it again by 1st week of July, is that ok? Thanks! |
closing as stale. this idea / fix looks pretty good though. so if you want to update, pls comment. |
git diff upstream/master --name-only -- '*.py' | flake8 --diff
Hey @jreback , need a quick help here. I was able to raise an overflow exception (case II), but I wasn't able to raise it on case III. Can you help? Thanks