Skip to content

API/BUG: Handling Dtype Coercions in Series/Index (GH 15832) #15859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

ucals
Copy link

@ucals ucals commented Apr 1, 2017

Hey @jreback , need a quick help here. I was able to raise an overflow exception (case II), but I wasn't able to raise it on case III. Can you help? Thanks

@@ -1032,8 +1032,7 @@ Reshaping
- Bug in ``pd.concat()`` in which concatting with an empty dataframe with ``join='inner'`` was being improperly handled (:issue:`15328`)
- Bug with ``sort=True`` in ``DataFrame.join`` and ``pd.merge`` when joining on indexes (:issue:`15582`)

Numeric
^^^^^^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls cleanly add things (don't delete)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. None of the changes should be in your PR given what you are focusing on. Don't forget to add ones for your actual patches though!

@@ -2809,6 +2810,16 @@ def _try_cast(arr, take_fast_path):
subarr = maybe_cast_to_datetime(arr, dtype)
if not is_extension_type(subarr):
subarr = np.array(subarr, dtype=dtype, copy=copy)

# Raises if coersion from unsigned to signed with neg data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_unsigned_integer_dtype and is_float_dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, coercion instead of coersion (you do this later on as well)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any changes though?

raise OverflowError

# Raises if coersion from float to int with fraction data
inferred_type, _ = infer_dtype_from_array(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just check the dtype you don't need to infer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2809,6 +2810,16 @@ def _try_cast(arr, take_fast_path):
subarr = maybe_cast_to_datetime(arr, dtype)
if not is_extension_type(subarr):
subarr = np.array(subarr, dtype=dtype, copy=copy)

# Raises if coersion from unsigned to signed with neg data
if dtype == np.dtype('uint') and len(subarr[subarr < 0]) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(subarr]<0).any()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -831,3 +831,20 @@ def test_constructor_cast_object(self):
s = Series(date_range('1/1/2000', periods=10), dtype=object)
exp = Series(date_range('1/1/2000', periods=10))
tm.assert_series_equal(s, exp)

def test_overflow_coersion_signed_to_unsigned(self):
Series([-1], dtype='uint8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

pytest.raises(OverflowError):
   ....

Copy link
Member

@gfyoung gfyoung Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific with regards to what @jreback said, here's an example:

with pytest.raises(OverflowError):
     Series([-1], dtype='uint16')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any changes though?

# Raises if coersion from unsigned to signed with neg data
if dtype == np.dtype('uint') and len(subarr[subarr < 0]) > 0:
raise OverflowError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add helpful messages to these raises.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@gfyoung
Copy link
Member

gfyoung commented Apr 2, 2017

I was able to raise an overflow exception (case II), but I wasn't able to raise it on case III.

Seems like you were able to successfully do that.

Also, don't forget to make similar patches to Index! And perhaps squash your commits when you have time so that the contributions and changes are a lot cleaner.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Apr 2, 2017
@jreback
Copy link
Contributor

jreback commented Apr 2, 2017

And perhaps squash your commits when you have time so that the contributions and changes are a lot cleaner.

This is not necessary. We will squash when merging. If it makes it cleaner for you, then sure,

@jreback
Copy link
Contributor

jreback commented Apr 2, 2017

@ucals did you push your changes?

@ucals
Copy link
Author

ucals commented Apr 2, 2017

Ops, forgot to push.... Pushing it right away, 1 sec

if not is_list_like(d2):
d2 = [data]
# Raises if coercion from unsigned to signed with neg data
if is_unsigned_integer_dtype(dtype) and any(x < 0 for x in d2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is python any and will be non-performant, use what I gave before

(d2<0).any()

Copy link
Author

@ucals ucals Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done... but this makes the test break, leading to this error:

TypeError: '<' not supported between instances of 'list' and 'int'

if is_unsigned_integer_dtype(dtype) and any(x < 0 for x in d2):
raise OverflowError(
"Trying to coerce negative values to unsigned "
"integers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negative integers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"integers")

# Raises if coercion from float to int with fraction data
if any([is_float_dtype(type(x)) for x in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a 1-d array, you don't need any

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2809,6 +2810,22 @@ def _try_cast(arr, take_fast_path):
subarr = maybe_cast_to_datetime(arr, dtype)
if not is_extension_type(subarr):
subarr = np.array(subarr, dtype=dtype, copy=copy)

d2 = data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section should be only if dtype is not None.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2809,6 +2810,22 @@ def _try_cast(arr, take_fast_path):
subarr = maybe_cast_to_datetime(arr, dtype)
if not is_extension_type(subarr):
subarr = np.array(subarr, dtype=dtype, copy=copy)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to move this entire section (e.g. _try_cast) to pandas/types/cast. but I will do that after

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to move part/all of this ok with that as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self):
pd.TimedeltaIndex(list(values), dtype=dtype)]:
tm.assert_index_equal(res, idx)

def test_constructor_overflow_coersion_signed_to_unsigned(self):
with pytest.raises(OverflowError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self):
pd.TimedeltaIndex(list(values), dtype=dtype)]:
tm.assert_index_equal(res, idx)

def test_constructor_overflow_coersion_signed_to_unsigned(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the dtypes in a loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that this works correct for int (1 dtype is ok)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

with pytest.raises(OverflowError):
Index([-1], dtype='uint64')

def test_constructor_overflow_coersion_float_to_int(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check all int & unsigned int dtypes (use a loop)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that this works for float

@@ -831,3 +832,20 @@ def test_constructor_cast_object(self):
s = Series(date_range('1/1/2000', periods=10), dtype=object)
exp = Series(date_range('1/1/2000', periods=10))
tm.assert_series_equal(s, exp)

def test_constructor_overflow_coersion_signed_to_unsigned(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put dtypes in a loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -236,7 +237,7 @@ def test_constructor_corner(self):
tm.assertIsInstance(s, Series)

def test_constructor_sanitize(self):
s = Series(np.array([1., 1., 8.]), dtype='i8')
Copy link
Contributor

@jreback jreback Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that the original raises

see my comment at the end

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

This should actually work. We allow casting to a compat dtype when there is no loss of precision.
so if you have a passed int/unsigned int with a float, then you cast to int, if it works you allow it to proceed. We already do this for indexes: https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/numeric.py#L146.

In [1]: Series([1., 2., 3.], dtype='int')
Out[1]: 
0    1
1    2
2    3
dtype: int64

The counter example of course are:

This should raises (as this loses precision).

In [5]: Series([1, 1.5],dtype='int')
Out[5]: 
0    1
1    1
dtype: int64

This is good. (pls just confirm that our error messages are the same for Index/Series).

In [6]: Series([1, np.nan],dtype='int')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-b52843b42d56> in <module>()
----> 1 Series([1, np.nan],dtype='int')

/Users/jreback/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    242             else:
    243                 data = _sanitize_array(data, index, dtype, copy,
--> 244                                        raise_cast_failure=True)
    245 
    246                 data = SingleBlockManager(data, index, fastpath=True)

/Users/jreback/pandas/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   2856         if dtype is not None:
   2857             try:
-> 2858                 subarr = _try_cast(data, False)
   2859             except Exception:
   2860                 if raise_cast_failure:  # pragma: no cover

/Users/jreback/pandas/pandas/core/series.py in _try_cast(arr, take_fast_path)
   2810             subarr = maybe_cast_to_datetime(arr, dtype)
   2811             if not is_extension_type(subarr):
-> 2812                 subarr = np.array(subarr, dtype=dtype, copy=copy)
   2813         except (ValueError, TypeError):
   2814             if is_categorical_dtype(dtype):

ValueError: cannot convert float NaN to integer

@ucals
Copy link
Author

ucals commented Apr 9, 2017

Hey @jreback , sorry for the past days, got overloaded in my work. But now I can resume the work on this PR. I'm having some difficulties with the code, can you help me? Specifically:

  1. the (d2<0).any() makes a test break
  2. if I don't use any in is_float_dtype(type([1, 2, 3.5])), I can't make it work
  3. I tried to implement the _assert_safe_casting for series as you pointed out we already do for indexes, but couldn't manage to make it work.

Can you help me? Pls let me know (I can submit the code as is now, all tests are ok and your other comments are ok as well)

@jreback
Copy link
Contributor

jreback commented Apr 9, 2017

@ucals make sure everything is pushed up (looks like you updated comments, but code is older I think). I will take a look tomorrow.

@@ -1043,9 +1043,6 @@ Numeric
- Bug in ``pandas.tools.utils.cartesian_product()`` with large input can cause overflow on windows (:issue:`15265`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)

Other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ucals these changes need to be reverted (aside from your 1 line addition)

@@ -326,6 +326,22 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
pass
# other iterable of some kind
subarr = _asarray_tuplesafe(data, dtype=object)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these don't below here at all, you can put them in numeric

@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self):
pd.TimedeltaIndex(list(values), dtype=dtype)]:
tm.assert_index_equal(res, idx)

def test_constructor_overflow_coercion_signed_to_unsigned(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be tested in test_numeric.py

@@ -983,3 +984,60 @@ def find_common_type(types):
return np.object

return np.find_common_type(types, [])


def try_cast(arr, take_fast_path, data, dtype=None, copy=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signature should be

def try_cast(arr, dtype=None, copy=False, raise_cast_failure=False, fastpath=True)

def try_cast(arr, take_fast_path, data, dtype=None, copy=False,
raise_cast_failure=False):
""" try casting the array using dtype. Used in _sanitize_array procedure
at Series initialization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to maybe_cast_to_array

try:
subarr = maybe_cast_to_datetime(arr, dtype)
if not is_extension_type(subarr):
subarr = np.array(subarr, dtype=dtype, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply return if its an extension_type

subarr = np.array(subarr, dtype=dtype, copy=copy)

if dtype is not None:
d2 = data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need any of this is_list_like stuff, by definition this is already an array

if copy or not is_integer_dtype(data):
assert_safe_casting(data, subarr)

if any([is_float_dtype(type(x)) for x in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this is much simpler, it is already an array

raise OverflowError(
"Trying to coerce float values to integers")

except (ValueError, TypeError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of the except you can simply put around the subarr = np.array(subarr, dtype=dtype, copy=copy) above.

The dtype checks need only occur after this.

def assert_safe_casting(data, subarr):
"""
Ensure incoming data can be represented as ints.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this routine is all over the place (in Index), will want to simply call this one.

@@ -2827,19 +2810,19 @@ def _try_cast(arr, take_fast_path):
# possibility of nan -> garbage
if is_float_dtype(data.dtype) and is_integer_dtype(dtype):
if not isnull(data).any():
subarr = _try_cast(data, True)
subarr = try_cast(data, True, data=data, dtype=dtype, copy=copy, raise_cast_failure=raise_cast_failure)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines are way too long so the linter will complain

@@ -326,6 +326,22 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
pass
# other iterable of some kind
subarr = _asarray_tuplesafe(data, dtype=object)

d2 = data
if not is_list_like(d2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need any of this is_list_like

subarr is already an array.

d2 = data
if not is_list_like(d2):
d2 = [data]
# Raises if coercion from unsigned to signed with neg data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 routines can actually live in pandas.types.cast, esentially all of this boils down to calling.

assert_safe_casting(subarr, dtype)



def assert_safe_casting(data, subarr):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a doc-string. the signature should be

assert_safe_casting(arr, dtype)

@ucals
Copy link
Author

ucals commented Apr 22, 2017

Hey @jreback , after trying and changing a lot over the past days, I'm starting over, fresh from master... will do a push today

@jreback
Copy link
Contributor

jreback commented May 13, 2017

can you rebase / update

@ucals
Copy link
Author

ucals commented May 15, 2017

Sure, @jreback , I will push current status tomorrow

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot change tests that don't work. there is still something odd with the logic. as I said there is a case that is allowed that you are raising on.


def maybe_cast_to_integer(arr, dtype):
"""
Find a common data type among the given dtypes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be more explicit about what this does. It takes an integer dtype and returns the casted version, raising for an incompat dtype.


Parameters
----------
arr : array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndarray
np.dtype


Returns
-------
integer or unsigned integer array (or raise if the dtype is incompatible)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a Raises section


"""

if is_unsigned_integer_dtype(dtype) and (np.asarray(arr) < 0).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are ndarays, so you don't need the np.asarray(arr) (though you can do it but it would be before any checks (it doesn't copy so its fine).


if is_unsigned_integer_dtype(dtype) and (np.asarray(arr) < 0).any():
raise OverflowError("Trying to coerce negative values to negative "
"integers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this then needs to cast, no?

e.g. (before the return)

arr = arr.astype(dtype, copy=False)

exp_index = pd.Index([0, 1, 2, 3, 1.1])
self._assert_setitem_index_conversion(obj, 1.1, exp_index, np.float64)
with pytest.raises(OverflowError):
exp_index = pd.Index([0, 1, 2, 3, 1.1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should not raise, rather the results coerce

@@ -373,8 +375,9 @@ def test_insert_index_int64(self):
self._assert_insert_conversion(obj, 1, exp, np.int64)

# int + float -> float
exp = pd.Index([1, 1.1, 2, 3, 4])
self._assert_insert_conversion(obj, 1.1, exp, np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, this is a legit operation

@@ -622,7 +625,8 @@ def test_where_series_int64(self):
self._where_int64_common(pd.Series)

def test_where_index_int64(self):
self._where_int64_common(pd.Index)
with pytest.raises(OverflowError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, remove the tests changes here

result = expected.sort_index()
tm.assert_series_equal(result, expected)
with pytest.raises(OverflowError):
df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? why are you having this raise. this is a legitimate operation. pls revert this.


def test_constructor_overflow_coercion_float_to_int(self):
# GH 15832
with pytest.raises(OverflowError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, move with related tests & test all uint & int dtypes.
assert that this works with float32/float64.

@ucals
Copy link
Author

ucals commented May 20, 2017

Hey @jreback , thanks for the comments. I just implemented all your comments, with exception of "assert that this works with float32/float64". Can you pls show me how?
Also, the code is breaking 1 test in test_pytables.py, and I don't know how to fix yet.

@jreback
Copy link
Contributor

jreback commented May 20, 2017

Hey @jreback , thanks for the comments. I just implemented all your comments, with exception of "assert that this works with float32/float64". Can you pls show me how?
Also, the code is breaking 1 test in test_pytables.py, and I don't know how to fix yet.

If we are going to raise on [3], then assert that [2] does not.

In [2]: s = Series([1, 2.5, 3], dtype='float64')

In [3]: s
Out[3]: 
0    1.0
1    2.5
2    3.0
dtype: float64

In [5]: s = Series([1, 2.5, 3], dtype='int64')

In [6]: s
Out[6]: 
0    1
1    2
2    3
dtype: int64

@ucals
Copy link
Author

ucals commented May 21, 2017

Just double-checking: looking in the issue, I understood that pd.Index([1, 2, 3.5], dtype=int) should raise. What about pd.Series([1, 2, 3.5], dtype=int)? It shouldn't, right?

@jreback
Copy link
Contributor

jreback commented May 21, 2017

Just double-checking: looking in the issue, I understood that pd.Index([1, 2, 3.5], dtype=int) should raise. What about pd.Series([1, 2, 3.5], dtype=int)? It shouldn't, right?

would be weird if these didn't act the same

@ucals
Copy link
Author

ucals commented May 21, 2017

Yes, that's what I thought. That's why we wrote in pandas/tests/series/test_constructors.py:

# GH 15832
for t in ['uint8', 'uint16', 'uint32', 'uint64']:
    with pytest.raises(ValueError):
        Series([1, 2, 3.5], dtype=t)

This test passes. However, it was written in pandas/tests/io/test_pytables.py:

# check with mixed dtypes
df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))
    for c in ['float32', 'float64', 'int32', 'int64', 'int16', 'int8']]))

If I don't change this test to expect a ValueError for the 4 int types, it will never pass, because it's the same case. These 2 cases are the same. How to differentiate those?

Thanks
Carlos

@jreback
Copy link
Contributor

jreback commented May 21, 2017

for t in ['uint8', 'uint16', 'uint32', 'uint64']:
with pytest.raises(ValueError):
Series([1, 2, 3.5], dtype=t)

should also raise for alll ints as well

check with mixed dtypes

df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))
    for c in ['float32', 'float64', 'int32', 'int64', 'int16', 'int8']]))

this is wrong, change it to something like this

df1 = DataFrame(dict([(c, Series(np.random.randn(5).astype(c)))
    for c in ['float32', 'float64', 'int32', 'int64', 'int16', 'int8']]))

@ucals
Copy link
Author

ucals commented May 22, 2017

@jreback , just implemented all changes, and all tests passed :)

@ucals
Copy link
Author

ucals commented May 27, 2017

so, any feedback?

@@ -248,6 +252,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
msg = str(e)
if 'cannot convert float' in msg:
raise
if 'Trying to coerce float values to integer' in msg:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do an or statement? That seems more compact.

@@ -304,6 +304,18 @@ def test_astype(self):
i = Float64Index([0, 1.1, np.NAN])
pytest.raises(ValueError, lambda: i.astype(dtype))

# GH 15832
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in a separate test

try:
for t in ['float16', 'float32']:
Index([1, 2, 3.5], dtype=t)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to catch the exception
rather compare against an expected index

try:
for t in ['float16', 'float32']:
Series([1, 2, 3.5], dtype=t)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

def test_constructor_dtype_nocast(self):
# 1572
s = Series([1, 2, 3])
s = Series([1, 2, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dupe?

@jreback
Copy link
Contributor

jreback commented May 27, 2017

looks pretty good
just some test comments
pls add a whatsnew entry in 0.21.0
in other api changes section

@@ -678,6 +690,12 @@ def test_constructor_corner(self):
with tm.assert_raises_regex(TypeError, 'casting'):
Int64Index(arr_with_floats)

def test_constructor_overflow_coercion_signed_to_unsigned(self):
# GH 15832
for t in ['uint8', 'uint16', 'uint32', 'uint64']:
Copy link
Member

@gfyoung gfyoung May 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's utilize pytest.mark.parametrize because we can now 😄

Copy link
Member

@gfyoung gfyoung May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ucals : This one has not yet been fully addressed (see test_numeric.py).

@@ -303,9 +299,27 @@ def test_constructor_pass_nan_nat(self):
def test_constructor_cast(self):
Copy link
Member

@gfyoung gfyoung May 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test name isn't particularly informative. Let's break this test up so that we can check for specific errors and also utilize pytest.mark.parametrize for cleaner tests.

@pytest.mark.parametrize(...)
def test_constructor_unsigned_dtype_overflow(self):
...

@pytest.mark.parametrize(...)
def test_constructor_coerce_float_fail(self:
...

Also, I prefer if we can use tm.assert_raises_regex to specify specific error messages instead of just testing for the Exception type. That's a stronger test IMO. The test above:

pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float)

I can begin to see / understand why it fails, but what is the exact reason where it is breaking? If it is separate from the ones you added, let's make that a test of its own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ucals : You broke up the tests, but your new tests still use pytest.raises when tm.assert_raises_regex would be more useful.

* If ``dtype`` is incompatible
ValueError
* If coercion from float to integer loses precision

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of good documentation, let's add some examples here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Nice examples.

@ucals
Copy link
Author

ucals commented May 29, 2017

Done, thanks @jreback and @gfyoung !

Index([1, 2, 3.5], dtype=integers)

i = Index([1, 2, 3.5], dtype=floats)
assert i.equals(Index([1, 2, 3.5]))
Copy link
Member

@gfyoung gfyoung May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tm.assert_index_equal(i, Index([1, 2, 3.5])) is better for testing purposes.

with tm.assert_raises_regex(OverflowError, msg):
Series([-1], dtype=unsigned_integers)

@pytest.mark.parametrize("integers", ['uint8', 'uint16', 'uint32',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use some more informative names like int_dtype (and float_dtype, etc. for tests where you parametrized on dtype).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to tighten up the guarantees of this function, even though its private. and be a bit more explicit.

@@ -54,7 +54,7 @@ Backwards incompatible API changes

Other API Changes
^^^^^^^^^^^^^^^^^

- Series and Index constructors now raises when data is incompatible with dtype (:issue:`15832`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a passed dtype= kwarg.

def maybe_cast_to_integer(arr, dtype):
"""
Takes an integer dtype and returns the casted version, raising for an
incompatible dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag (even though this is private, nice to have)



def maybe_cast_to_integer(arr, dtype):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually can take any dtype and will return a casted version. It happens to check for integer/unsigned integer dtypes. So pls expand a little.

"integers")
elif is_integer_dtype(dtype) and (is_float_dtype(arr) or
is_object_dtype(arr)):
if not (arr == arr.astype(dtype)).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do

casted = arr.astype(dtype, copy=False)
if (arr == casted).all():
    return casted
raise ValueError(......)

a bit more code but then avoid coercion twice.

@@ -212,11 +213,14 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
if is_integer_dtype(dtype):
inferred = lib.infer_dtype(data)
if inferred == 'integer':
data = maybe_cast_to_integer(data, dtype=dtype)
data = np.array(data, copy=copy, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the np.array casting line now

@@ -1026,3 +1027,52 @@ def find_common_type(types):
return np.object

return np.find_common_type(types, [])


def maybe_cast_to_integer(arr, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to maybe_cast_to_integer_array

Copy link
Member

@gfyoung gfyoung May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Perhaps we could just generalize to accept scalar as well as array (as consistent with other casting functions in this module) and keep the name as is?

data = np.array(data, copy=copy, dtype=dtype)
elif inferred in ['floating', 'mixed-integer-float']:
if isnull(data).any():
raise ValueError('cannot convert float '
'NaN to integer')
if inferred == 'mixed-integer-float':
maybe_cast_to_integer(data, dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be assigned?

@@ -246,7 +250,9 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,

except (TypeError, ValueError) as e:
msg = str(e)
if 'cannot convert float' in msg:
if 'cannot convert float' in msg or 'Trying to coerce ' \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do

if ('cannot convert float' in msg' or
        '.........' in msg): 
    raise

don't use \ if at all possible

if not (arr == arr.astype(dtype)).all():
raise ValueError("Trying to coerce float values to integers")

return arr.astype(dtype, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should cast here. This will coerce non-integer arrays to the passed dtype, which may not be nice.

instead branch on each of the if's (the integer and unsigned cases) and do an astype or raise).

so this will then explicity only work on integer dtypes that are passed, and not implicity on anything else

@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@ucals
Copy link
Author

ucals commented Jun 20, 2017

Hey @jreback , sorry for the delay, a lot to do in work until end of Q2. I believe I will be able to look it again by 1st week of July, is that ok? Thanks!

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

closing as stale. this idea / fix looks pretty good though. so if you want to update, pls comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/BUG: Handling Dtype Coercions in Series/Index
4 participants