API/BUG: Handling Dtype Coercions in Series/Index (GH 15832) #15859

ucals · 2017-04-01T15:18:32Z

closes API/BUG: Handling Dtype Coercions in Series/Index #15832
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

Hey @jreback , need a quick help here. I was able to raise an overflow exception (case II), but I wasn't able to raise it on case III. Can you help? Thanks

jreback · 2017-04-01T15:38:20Z

doc/source/whatsnew/v0.20.0.txt

@@ -1032,8 +1032,7 @@ Reshaping
 - Bug in ``pd.concat()`` in which concatting with an empty dataframe with ``join='inner'`` was being improperly handled (:issue:`15328`)
 - Bug with ``sort=True`` in ``DataFrame.join`` and ``pd.merge`` when joining on indexes (:issue:`15582`)

-Numeric
-^^^^^^^


pls cleanly add things (don't delete)

Agreed. None of the changes should be in your PR given what you are focusing on. Don't forget to add ones for your actual patches though!

jreback · 2017-04-01T15:39:23Z

pandas/core/series.py

@@ -2809,6 +2810,16 @@ def _try_cast(arr, take_fast_path):
            subarr = maybe_cast_to_datetime(arr, dtype)
            if not is_extension_type(subarr):
                subarr = np.array(subarr, dtype=dtype, copy=copy)
+
+                # Raises if coersion from unsigned to signed with neg data


use is_unsigned_integer_dtype and is_float_dtype.

Also, coercion instead of coersion (you do this later on as well)

I don't see any changes though?

jreback · 2017-04-01T15:39:37Z

pandas/core/series.py

+                    raise OverflowError
+
+                # Raises if coersion from float to int with fraction data
+                inferred_type, _ = infer_dtype_from_array(data)


just check the dtype you don't need to infer

jreback · 2017-04-01T15:40:22Z

pandas/core/series.py

@@ -2809,6 +2810,16 @@ def _try_cast(arr, take_fast_path):
            subarr = maybe_cast_to_datetime(arr, dtype)
            if not is_extension_type(subarr):
                subarr = np.array(subarr, dtype=dtype, copy=copy)
+
+                # Raises if coersion from unsigned to signed with neg data
+                if dtype == np.dtype('uint') and len(subarr[subarr < 0]) > 0:


(subarr]<0).any()

jreback · 2017-04-01T15:40:54Z

pandas/tests/series/test_constructors.py

@@ -831,3 +831,20 @@ def test_constructor_cast_object(self):
        s = Series(date_range('1/1/2000', periods=10), dtype=object)
        exp = Series(date_range('1/1/2000', periods=10))
        tm.assert_series_equal(s, exp)
+
+    def test_overflow_coersion_signed_to_unsigned(self):
+        Series([-1], dtype='uint8')


use

pytest.raises(OverflowError): ....

To be more specific with regards to what @jreback said, here's an example:

with pytest.raises(OverflowError): Series([-1], dtype='uint16')

I don't see any changes though?

jreback · 2017-04-01T15:41:33Z

pandas/core/series.py

+                # Raises if coersion from unsigned to signed with neg data
+                if dtype == np.dtype('uint') and len(subarr[subarr < 0]) > 0:
+                    raise OverflowError
+


add helpful messages to these raises.

gfyoung · 2017-04-02T06:04:05Z

I was able to raise an overflow exception (case II), but I wasn't able to raise it on case III.

Seems like you were able to successfully do that.

Also, don't forget to make similar patches to Index! And perhaps squash your commits when you have time so that the contributions and changes are a lot cleaner.

jreback · 2017-04-02T14:23:09Z

And perhaps squash your commits when you have time so that the contributions and changes are a lot cleaner.

This is not necessary. We will squash when merging. If it makes it cleaner for you, then sure,

jreback · 2017-04-02T22:27:39Z

@ucals did you push your changes?

ucals · 2017-04-02T23:31:46Z

Ops, forgot to push.... Pushing it right away, 1 sec

jreback · 2017-04-03T12:46:35Z

pandas/core/series.py

+                if not is_list_like(d2):
+                    d2 = [data]
+                # Raises if coercion from unsigned to signed with neg data
+                if is_unsigned_integer_dtype(dtype) and any(x < 0 for x in d2):


this is python any and will be non-performant, use what I gave before

(d2<0).any()

done... but this makes the test break, leading to this error:

TypeError: '<' not supported between instances of 'list' and 'int'

jreback · 2017-04-03T12:46:58Z

pandas/core/series.py

+                if is_unsigned_integer_dtype(dtype) and any(x < 0 for x in d2):
+                    raise OverflowError(
+                        "Trying to coerce negative values to unsigned "
+                        "integers")


negative integers

jreback · 2017-04-03T12:47:14Z

pandas/core/series.py

+                        "integers")
+
+                # Raises if coercion from float to int with fraction data
+                if any([is_float_dtype(type(x)) for x in


this is a 1-d array, you don't need any

jreback · 2017-04-03T12:48:01Z

pandas/core/series.py

@@ -2809,6 +2810,22 @@ def _try_cast(arr, take_fast_path):
            subarr = maybe_cast_to_datetime(arr, dtype)
            if not is_extension_type(subarr):
                subarr = np.array(subarr, dtype=dtype, copy=copy)
+
+                d2 = data


this section should be only if dtype is not None.

jreback · 2017-04-03T12:50:10Z

pandas/core/series.py

@@ -2809,6 +2810,22 @@ def _try_cast(arr, take_fast_path):
            subarr = maybe_cast_to_datetime(arr, dtype)
            if not is_extension_type(subarr):
                subarr = np.array(subarr, dtype=dtype, copy=copy)
+


I am going to move this entire section (e.g. _try_cast) to pandas/types/cast. but I will do that after

if you want to move part/all of this ok with that as well.

jreback · 2017-04-03T12:50:45Z

pandas/tests/indexes/test_base.py

@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self):
                        pd.TimedeltaIndex(list(values), dtype=dtype)]:
                tm.assert_index_equal(res, idx)

+    def test_constructor_overflow_coersion_signed_to_unsigned(self):
+        with pytest.raises(OverflowError):


add the issue number as a comment

jreback · 2017-04-03T12:52:10Z

pandas/tests/indexes/test_base.py

@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self):
                        pd.TimedeltaIndex(list(values), dtype=dtype)]:
                tm.assert_index_equal(res, idx)

+    def test_constructor_overflow_coersion_signed_to_unsigned(self):


put the dtypes in a loop

assert that this works correct for int (1 dtype is ok)

jreback · 2017-04-03T12:52:57Z

pandas/tests/indexes/test_base.py

+        with pytest.raises(OverflowError):
+            Index([-1], dtype='uint64')
+
+    def test_constructor_overflow_coersion_float_to_int(self):


check all int & unsigned int dtypes (use a loop)

assert that this works for float

jreback · 2017-04-03T12:54:09Z

pandas/tests/series/test_constructors.py

@@ -831,3 +832,20 @@ def test_constructor_cast_object(self):
        s = Series(date_range('1/1/2000', periods=10), dtype=object)
        exp = Series(date_range('1/1/2000', periods=10))
        tm.assert_series_equal(s, exp)
+
+    def test_constructor_overflow_coersion_signed_to_unsigned(self):


put dtypes in a loop

jreback · 2017-04-03T12:54:32Z

pandas/tests/series/test_constructors.py

@@ -236,7 +237,7 @@ def test_constructor_corner(self):
        tm.assertIsInstance(s, Series)

    def test_constructor_sanitize(self):
-        s = Series(np.array([1., 1., 8.]), dtype='i8')


~~assert that the original raises~~

see my comment at the end

jreback · 2017-04-03T12:57:48Z

This should actually work. We allow casting to a compat dtype when there is no loss of precision.
so if you have a passed int/unsigned int with a float, then you cast to int, if it works you allow it to proceed. We already do this for indexes: https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/numeric.py#L146.

In [1]: Series([1., 2., 3.], dtype='int')
Out[1]: 
0    1
1    2
2    3
dtype: int64

The counter example of course are:

This should raises (as this loses precision).

In [5]: Series([1, 1.5],dtype='int')
Out[5]: 
0    1
1    1
dtype: int64

This is good. (pls just confirm that our error messages are the same for Index/Series).

In [6]: Series([1, np.nan],dtype='int')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-b52843b42d56> in <module>()
----> 1 Series([1, np.nan],dtype='int')

/Users/jreback/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    242             else:
    243                 data = _sanitize_array(data, index, dtype, copy,
--> 244                                        raise_cast_failure=True)
    245 
    246                 data = SingleBlockManager(data, index, fastpath=True)

/Users/jreback/pandas/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   2856         if dtype is not None:
   2857             try:
-> 2858                 subarr = _try_cast(data, False)
   2859             except Exception:
   2860                 if raise_cast_failure:  # pragma: no cover

/Users/jreback/pandas/pandas/core/series.py in _try_cast(arr, take_fast_path)
   2810             subarr = maybe_cast_to_datetime(arr, dtype)
   2811             if not is_extension_type(subarr):
-> 2812                 subarr = np.array(subarr, dtype=dtype, copy=copy)
   2813         except (ValueError, TypeError):
   2814             if is_categorical_dtype(dtype):

ValueError: cannot convert float NaN to integer

ucals · 2017-04-09T13:18:52Z

Hey @jreback , sorry for the past days, got overloaded in my work. But now I can resume the work on this PR. I'm having some difficulties with the code, can you help me? Specifically:

the (d2<0).any() makes a test break
if I don't use any in is_float_dtype(type([1, 2, 3.5])), I can't make it work
I tried to implement the _assert_safe_casting for series as you pointed out we already do for indexes, but couldn't manage to make it work.

Can you help me? Pls let me know (I can submit the code as is now, all tests are ok and your other comments are ok as well)

jreback · 2017-04-09T13:44:51Z

@ucals make sure everything is pushed up (looks like you updated comments, but code is older I think). I will take a look tomorrow.

jreback · 2017-04-09T14:41:09Z

doc/source/whatsnew/v0.20.0.txt

@@ -1043,9 +1043,6 @@ Numeric
 - Bug in ``pandas.tools.utils.cartesian_product()`` with large input can cause overflow on windows (:issue:`15265`)
 - Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)

-Other


@ucals these changes need to be reverted (aside from your 1 line addition)

jreback · 2017-04-09T14:42:42Z

pandas/indexes/base.py

@@ -326,6 +326,22 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
                        pass
            # other iterable of some kind
            subarr = _asarray_tuplesafe(data, dtype=object)
+


these don't below here at all, you can put them in numeric

jreback · 2017-04-09T14:43:00Z

pandas/tests/indexes/test_base.py

@@ -378,6 +379,23 @@ def test_constructor_dtypes_timedelta(self):
                        pd.TimedeltaIndex(list(values), dtype=dtype)]:
                tm.assert_index_equal(res, idx)

+    def test_constructor_overflow_coercion_signed_to_unsigned(self):


these should be tested in test_numeric.py

jreback · 2017-04-09T14:45:02Z

pandas/types/cast.py

@@ -983,3 +984,60 @@ def find_common_type(types):
            return np.object

    return np.find_common_type(types, [])
+
+
+def try_cast(arr, take_fast_path, data, dtype=None, copy=False,


signature should be

def try_cast(arr, dtype=None, copy=False, raise_cast_failure=False, fastpath=True)

jreback · 2017-04-09T14:46:57Z

pandas/types/cast.py

+def try_cast(arr, take_fast_path, data, dtype=None, copy=False,
+             raise_cast_failure=False):
+    """ try casting the array using dtype. Used in _sanitize_array procedure
+    at Series initialization


rename to maybe_cast_to_array

jreback · 2017-04-09T14:48:27Z

pandas/types/cast.py

+    try:
+        subarr = maybe_cast_to_datetime(arr, dtype)
+        if not is_extension_type(subarr):
+            subarr = np.array(subarr, dtype=dtype, copy=copy)


you can simply return if its an extension_type

jreback · 2017-04-09T14:49:03Z

pandas/types/cast.py

+            subarr = np.array(subarr, dtype=dtype, copy=copy)
+
+            if dtype is not None:
+                d2 = data


you don't need any of this is_list_like stuff, by definition this is already an array

jreback · 2017-04-09T14:49:30Z

pandas/types/cast.py

+                if copy or not is_integer_dtype(data):
+                    assert_safe_casting(data, subarr)
+
+                if any([is_float_dtype(type(x)) for x in


again this is much simpler, it is already an array

jreback · 2017-04-09T14:50:37Z

pandas/types/cast.py

+                    raise OverflowError(
+                        "Trying to coerce float values to integers")
+
+    except (ValueError, TypeError):


this part of the except you can simply put around the subarr = np.array(subarr, dtype=dtype, copy=copy) above.

The dtype checks need only occur after this.

jreback · 2017-04-09T14:51:17Z

pandas/types/cast.py

+def assert_safe_casting(data, subarr):
+    """
+    Ensure incoming data can be represented as ints.
+    """


this routine is all over the place (in Index), will want to simply call this one.

jreback · 2017-04-09T14:51:51Z

pandas/core/series.py

@@ -2827,19 +2810,19 @@ def _try_cast(arr, take_fast_path):
            # possibility of nan -> garbage
            if is_float_dtype(data.dtype) and is_integer_dtype(dtype):
                if not isnull(data).any():
-                    subarr = _try_cast(data, True)
+                    subarr = try_cast(data, True, data=data, dtype=dtype, copy=copy, raise_cast_failure=raise_cast_failure)


these lines are way too long so the linter will complain

jreback · 2017-04-09T14:52:24Z

pandas/indexes/base.py

@@ -326,6 +326,22 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
                        pass
            # other iterable of some kind
            subarr = _asarray_tuplesafe(data, dtype=object)
+
+            d2 = data
+            if not is_list_like(d2):


you don't need any of this is_list_like

subarr is already an array.

jreback · 2017-04-09T14:53:48Z

pandas/indexes/base.py

+            d2 = data
+            if not is_list_like(d2):
+                d2 = [data]
+            # Raises if coercion from unsigned to signed with neg data


these 2 routines can actually live in pandas.types.cast, esentially all of this boils down to calling.

assert_safe_casting(subarr, dtype)

jreback · 2017-04-09T14:54:22Z

pandas/types/cast.py

+
+
+def assert_safe_casting(data, subarr):
+    """


add a doc-string. the signature should be

assert_safe_casting(arr, dtype)

ucals · 2017-04-22T19:04:08Z

Hey @jreback , after trying and changing a lot over the past days, I'm starting over, fresh from master... will do a push today

jreback · 2017-05-13T21:40:10Z

can you rebase / update

ucals · 2017-05-15T02:32:55Z

Sure, @jreback , I will push current status tomorrow

…s - help needed)

jreback

you cannot change tests that don't work. there is still something odd with the logic. as I said there is a case that is allowed that you are raising on.

jreback · 2017-05-20T15:03:26Z

pandas/core/dtypes/cast.py

+
+def maybe_cast_to_integer(arr, dtype):
+    """
+    Find a common data type among the given dtypes.


be more explicit about what this does. It takes an integer dtype and returns the casted version, raising for an incompat dtype.

jreback · 2017-05-20T15:03:39Z

pandas/core/dtypes/cast.py

+
+    Parameters
+    ----------
+    arr : array


ndarray
np.dtype

jreback · 2017-05-20T15:03:52Z

pandas/core/dtypes/cast.py

+
+    Returns
+    -------
+    integer or unsigned integer array (or raise if the dtype is incompatible)


add a Raises section

jreback · 2017-05-20T15:04:23Z

pandas/core/dtypes/cast.py

+
+    """
+
+    if is_unsigned_integer_dtype(dtype) and (np.asarray(arr) < 0).any():


these are ndarays, so you don't need the np.asarray(arr) (though you can do it but it would be before any checks (it doesn't copy so its fine).

jreback · 2017-05-20T15:05:07Z

pandas/core/dtypes/cast.py

+
+    if is_unsigned_integer_dtype(dtype) and (np.asarray(arr) < 0).any():
+        raise OverflowError("Trying to coerce negative values to negative "
+                            "integers")


this then needs to cast, no?

e.g. (before the return)

arr = arr.astype(dtype, copy=False)

jreback · 2017-05-20T15:09:15Z

pandas/tests/indexing/test_coercion.py

-        exp_index = pd.Index([0, 1, 2, 3, 1.1])
-        self._assert_setitem_index_conversion(obj, 1.1, exp_index, np.float64)
+        with pytest.raises(OverflowError):
+            exp_index = pd.Index([0, 1, 2, 3, 1.1])


these should not raise, rather the results coerce

jreback · 2017-05-20T15:09:25Z

pandas/tests/indexing/test_coercion.py

@@ -373,8 +375,9 @@ def test_insert_index_int64(self):
        self._assert_insert_conversion(obj, 1, exp, np.int64)

        # int + float -> float
-        exp = pd.Index([1, 1.1, 2, 3, 4])
-        self._assert_insert_conversion(obj, 1.1, exp, np.float64)


same, this is a legit operation

jreback · 2017-05-20T15:09:41Z

pandas/tests/indexing/test_coercion.py

@@ -622,7 +625,8 @@ def test_where_series_int64(self):
        self._where_int64_common(pd.Series)

    def test_where_index_int64(self):
-        self._where_int64_common(pd.Index)
+        with pytest.raises(OverflowError):


same, remove the tests changes here

jreback · 2017-05-20T15:10:03Z

pandas/tests/io/test_pytables.py

-            result = expected.sort_index()
-            tm.assert_series_equal(result, expected)
+            with pytest.raises(OverflowError):
+                df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))


huh? why are you having this raise. this is a legitimate operation. pls revert this.

jreback · 2017-05-20T15:10:44Z

pandas/tests/series/test_constructors.py

+
+    def test_constructor_overflow_coercion_float_to_int(self):
+        # GH 15832
+        with pytest.raises(OverflowError):


same as above, move with related tests & test all uint & int dtypes.
assert that this works with float32/float64.

ucals · 2017-05-20T20:13:00Z

Hey @jreback , thanks for the comments. I just implemented all your comments, with exception of "assert that this works with float32/float64". Can you pls show me how?
Also, the code is breaking 1 test in test_pytables.py, and I don't know how to fix yet.

jreback · 2017-05-20T20:41:17Z

Hey @jreback , thanks for the comments. I just implemented all your comments, with exception of "assert that this works with float32/float64". Can you pls show me how?
Also, the code is breaking 1 test in test_pytables.py, and I don't know how to fix yet.

If we are going to raise on [3], then assert that [2] does not.

In [2]: s = Series([1, 2.5, 3], dtype='float64')

In [3]: s
Out[3]: 
0    1.0
1    2.5
2    3.0
dtype: float64

In [5]: s = Series([1, 2.5, 3], dtype='int64')

In [6]: s
Out[6]: 
0    1
1    2
2    3
dtype: int64

ucals · 2017-05-21T15:27:21Z

Just double-checking: looking in the issue, I understood that pd.Index([1, 2, 3.5], dtype=int) should raise. What about pd.Series([1, 2, 3.5], dtype=int)? It shouldn't, right?

jreback · 2017-05-21T15:51:52Z

Just double-checking: looking in the issue, I understood that pd.Index([1, 2, 3.5], dtype=int) should raise. What about pd.Series([1, 2, 3.5], dtype=int)? It shouldn't, right?

would be weird if these didn't act the same

ucals · 2017-05-21T19:38:39Z

Yes, that's what I thought. That's why we wrote in pandas/tests/series/test_constructors.py:

# GH 15832
for t in ['uint8', 'uint16', 'uint32', 'uint64']:
    with pytest.raises(ValueError):
        Series([1, 2, 3.5], dtype=t)

This test passes. However, it was written in pandas/tests/io/test_pytables.py:

# check with mixed dtypes
df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))
    for c in ['float32', 'float64', 'int32', 'int64', 'int16', 'int8']]))

If I don't change this test to expect a ValueError for the 4 int types, it will never pass, because it's the same case. These 2 cases are the same. How to differentiate those?

Thanks
Carlos

jreback · 2017-05-21T19:51:15Z

for t in ['uint8', 'uint16', 'uint32', 'uint64']:
with pytest.raises(ValueError):
Series([1, 2, 3.5], dtype=t)

should also raise for alll ints as well

check with mixed dtypes

df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))
    for c in ['float32', 'float64', 'int32', 'int64', 'int16', 'int8']]))

this is wrong, change it to something like this

df1 = DataFrame(dict([(c, Series(np.random.randn(5).astype(c)))
    for c in ['float32', 'float64', 'int32', 'int64', 'int16', 'int8']]))

ucals · 2017-05-22T02:14:56Z

@jreback , just implemented all changes, and all tests passed :)

ucals · 2017-05-27T02:59:56Z

so, any feedback?

jreback · 2017-05-27T20:23:04Z

pandas/core/indexes/base.py

@@ -248,6 +252,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
                    msg = str(e)
                    if 'cannot convert float' in msg:
                        raise
+                    if 'Trying to coerce float values to integer' in msg:
+                        raise


Why not just do an or statement? That seems more compact.

jreback · 2017-05-27T20:23:51Z

pandas/tests/indexes/test_numeric.py

@@ -304,6 +304,18 @@ def test_astype(self):
            i = Float64Index([0, 1.1, np.NAN])
            pytest.raises(ValueError, lambda: i.astype(dtype))

+        # GH 15832


put this in a separate test

jreback · 2017-05-27T20:24:54Z

pandas/tests/indexes/test_numeric.py

+        try:
+            for t in ['float16', 'float32']:
+                Index([1, 2, 3.5], dtype=t)
+        except ValueError:


you don't need to catch the exception
rather compare against an expected index

jreback · 2017-05-27T20:25:27Z

pandas/tests/series/test_constructors.py

+        try:
+            for t in ['float16', 'float32']:
+                Series([1, 2, 3.5], dtype=t)
+        except ValueError:


jreback · 2017-05-27T20:25:35Z

pandas/tests/series/test_constructors.py

    def test_constructor_dtype_nocast(self):
        # 1572
        s = Series([1, 2, 3])
+        s = Series([1, 2, 3])


jreback · 2017-05-27T20:26:54Z

looks pretty good
just some test comments
pls add a whatsnew entry in 0.21.0
in other api changes section

gfyoung · 2017-05-28T00:01:20Z

pandas/tests/indexes/test_numeric.py

@@ -678,6 +690,12 @@ def test_constructor_corner(self):
        with tm.assert_raises_regex(TypeError, 'casting'):
            Int64Index(arr_with_floats)

+    def test_constructor_overflow_coercion_signed_to_unsigned(self):
+        # GH 15832
+        for t in ['uint8', 'uint16', 'uint32', 'uint64']:


Let's utilize pytest.mark.parametrize because we can now 😄

@ucals : This one has not yet been fully addressed (see test_numeric.py).

gfyoung · 2017-05-28T00:04:57Z

pandas/tests/series/test_constructors.py

@@ -303,9 +299,27 @@ def test_constructor_pass_nan_nat(self):
    def test_constructor_cast(self):


This test name isn't particularly informative. Let's break this test up so that we can check for specific errors and also utilize pytest.mark.parametrize for cleaner tests.

@pytest.mark.parametrize(...) def test_constructor_unsigned_dtype_overflow(self): ... @pytest.mark.parametrize(...) def test_constructor_coerce_float_fail(self: ...

Also, I prefer if we can use tm.assert_raises_regex to specify specific error messages instead of just testing for the Exception type. That's a stronger test IMO. The test above:

pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float)

I can begin to see / understand why it fails, but what is the exact reason where it is breaking? If it is separate from the ones you added, let's make that a test of its own.

@ucals : You broke up the tests, but your new tests still use pytest.raises when tm.assert_raises_regex would be more useful.

gfyoung · 2017-05-28T00:06:08Z

pandas/core/dtypes/cast.py

+        * If ``dtype`` is incompatible
+    ValueError
+        * If coercion from float to integer loses precision
+


In the spirit of good documentation, let's add some examples here!

Excellent! Nice examples.

ucals · 2017-05-29T07:17:15Z

Done, thanks @jreback and @gfyoung !

…15832

gfyoung · 2017-05-29T08:17:12Z

pandas/tests/indexes/test_numeric.py

+            Index([1, 2, 3.5], dtype=integers)
+
+        i = Index([1, 2, 3.5], dtype=floats)
+        assert i.equals(Index([1, 2, 3.5]))


tm.assert_index_equal(i, Index([1, 2, 3.5])) is better for testing purposes.

gfyoung · 2017-05-29T08:18:18Z

pandas/tests/series/test_constructors.py

+        with tm.assert_raises_regex(OverflowError, msg):
+            Series([-1], dtype=unsigned_integers)
+
+    @pytest.mark.parametrize("integers", ['uint8', 'uint16', 'uint32',


Let's use some more informative names like int_dtype (and float_dtype, etc. for tests where you parametrized on dtype).

jreback

need to tighten up the guarantees of this function, even though its private. and be a bit more explicit.

jreback · 2017-05-29T16:06:56Z

doc/source/whatsnew/v0.21.0.txt

@@ -54,7 +54,7 @@ Backwards incompatible API changes

 Other API Changes
 ^^^^^^^^^^^^^^^^^
-
+- Series and Index constructors now raises when data is incompatible with dtype (:issue:`15832`)


with a passed dtype= kwarg.

jreback · 2017-05-29T16:07:16Z

pandas/core/dtypes/cast.py

+def maybe_cast_to_integer(arr, dtype):
+    """
+    Takes an integer dtype and returns the casted version, raising for an
+    incompatible dtype.


add a versionadded tag (even though this is private, nice to have)

jreback · 2017-05-29T16:09:21Z

pandas/core/dtypes/cast.py

+
+
+def maybe_cast_to_integer(arr, dtype):
+    """


This actually can take any dtype and will return a casted version. It happens to check for integer/unsigned integer dtypes. So pls expand a little.

jreback · 2017-05-29T16:10:00Z

pandas/core/dtypes/cast.py

+                            "integers")
+    elif is_integer_dtype(dtype) and (is_float_dtype(arr) or
+                                      is_object_dtype(arr)):
+        if not (arr == arr.astype(dtype)).all():


do

casted = arr.astype(dtype, copy=False) if (arr == casted).all(): return casted raise ValueError(......)

a bit more code but then avoid coercion twice.

jreback · 2017-05-29T16:12:17Z

pandas/core/indexes/base.py

@@ -212,11 +213,14 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
                    if is_integer_dtype(dtype):
                        inferred = lib.infer_dtype(data)
                        if inferred == 'integer':
+                            data = maybe_cast_to_integer(data, dtype=dtype)
                            data = np.array(data, copy=copy, dtype=dtype)


we don't need the np.array casting line now

jreback · 2017-05-29T16:12:31Z

pandas/core/dtypes/cast.py

@@ -1026,3 +1027,52 @@ def find_common_type(types):
            return np.object

    return np.find_common_type(types, [])
+
+
+def maybe_cast_to_integer(arr, dtype):


rename to maybe_cast_to_integer_array

@jreback : Perhaps we could just generalize to accept scalar as well as array (as consistent with other casting functions in this module) and keep the name as is?

jreback · 2017-05-29T16:12:57Z

pandas/core/indexes/base.py

                            data = np.array(data, copy=copy, dtype=dtype)
                        elif inferred in ['floating', 'mixed-integer-float']:
                            if isnull(data).any():
                                raise ValueError('cannot convert float '
                                                 'NaN to integer')
+                            if inferred == 'mixed-integer-float':
+                                maybe_cast_to_integer(data, dtype)


does this need to be assigned?

jreback · 2017-05-29T16:13:41Z

pandas/core/indexes/base.py

@@ -246,7 +250,9 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,

                except (TypeError, ValueError) as e:
                    msg = str(e)
-                    if 'cannot convert float' in msg:
+                    if 'cannot convert float' in msg or 'Trying to coerce ' \


do

if ('cannot convert float' in msg' or '.........' in msg): raise

don't use \ if at all possible

jreback · 2017-05-29T16:15:19Z

pandas/core/dtypes/cast.py

+        if not (arr == arr.astype(dtype)).all():
+            raise ValueError("Trying to coerce float values to integers")
+
+    return arr.astype(dtype, copy=False)


I don't think you should cast here. This will coerce non-integer arrays to the passed dtype, which may not be nice.

instead branch on each of the if's (the integer and unsigned cases) and do an astype or raise).

so this will then explicity only work on integer dtypes that are passed, and not implicity on anything else

jreback · 2017-06-10T19:08:37Z

can you rebase and update?

ucals · 2017-06-20T17:19:52Z

Hey @jreback , sorry for the delay, a lot to do in work until end of Q2. I believe I will be able to look it again by 1st week of July, is that ok? Thanks!

jreback · 2017-08-17T10:34:25Z

closing as stale. this idea / fix looks pretty good though. so if you want to update, pls comment.

Carlos Souza added 7 commits March 20, 2017 19:32

Test

676a4e5

Sync fork

e12bca7

Merge remote-tracking branch 'upstream/master'

9fc617b

Merge remote-tracking branch 'upstream/master'

8b463cb

Merge remote-tracking branch 'upstream/master'

43456a5

Merge remote-tracking branch 'upstream/master'

faa5c5c

Merge remote-tracking branch 'upstream/master'

1c90e7e

jreback requested changes Apr 1, 2017

View reviewed changes

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Apr 2, 2017

jreback requested changes Apr 3, 2017

View reviewed changes

jreback mentioned this pull request Apr 3, 2017

BUG: read_csv returns object dtype for dates in empty frame #15524

Open

jreback reviewed Apr 9, 2017

View reviewed changes

jreback requested changes Apr 9, 2017

View reviewed changes

Merge remote-tracking branch 'upstream/master'

278c2fb

Carlos Souza added 3 commits May 16, 2017 10:50

Merge remote-tracking branch 'upstream/master'

a8cd752

Adding failing tests

bbdea4b

Code passing new tests from issue GH 15832 (but breaking 2 other test…

d2e26ac

…s - help needed)

jreback requested changes May 20, 2017

View reviewed changes

Adding all @jreback comments

359086d

Adding tests for all ints and adjust pytables test

50950f5

jreback requested changes May 27, 2017

View reviewed changes

gfyoung reviewed May 28, 2017

View reviewed changes

Carlos Souza added 2 commits May 29, 2017 03:12

Implementing final adjustments from @jreback and @gfyoung

012fb57

Merge branch 'master' into bug-fix-15832

35a5ff1

Carlos Souza added 2 commits May 29, 2017 04:11

Adding final comments from @gfyoung

b1e6632

Merge branch 'bug-fix-15832' of github.com:ucals/pandas into bug-fix-…

a1033cb

…15832

gfyoung reviewed May 29, 2017

View reviewed changes

jreback requested changes May 29, 2017

View reviewed changes

Adding final comments from @jreback

b78f4cc

jreback closed this Aug 17, 2017

gfyoung mentioned this pull request Jun 13, 2018

API/BUG: Raise when int-dtype coercions fail #21456

Merged


		"""

		if is_unsigned_integer_dtype(dtype) and (np.asarray(arr) < 0).any():

		@@ -303,9 +299,27 @@ def test_constructor_pass_nan_nat(self):
		def test_constructor_cast(self):

Uh oh!

API/BUG: Handling Dtype Coercions in Series/Index (GH 15832) #15859

API/BUG: Handling Dtype Coercions in Series/Index (GH 15832) #15859

Uh oh!

Conversation

ucals commented Apr 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Apr 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung commented Apr 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Apr 2, 2017

Uh oh!

jreback commented Apr 2, 2017

Uh oh!

ucals commented Apr 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ucals Apr 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ucals commented Apr 1, 2017 •

edited

Loading

gfyoung Apr 2, 2017 •

edited

Loading

gfyoung commented Apr 2, 2017 •

edited

Loading

ucals Apr 4, 2017 •

edited

Loading

jreback Apr 3, 2017 •

edited

Loading

jreback commented Apr 3, 2017 •

edited

Loading

ucals commented Apr 9, 2017 •

edited

Loading