Skip to content

API/BUG: Handling Dtype Coercions in Series/Index (GH 15832) #15859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
676a4e5
Test
Mar 20, 2017
e12bca7
Sync fork
Mar 24, 2017
9fc617b
Merge remote-tracking branch 'upstream/master'
Mar 26, 2017
8b463cb
Merge remote-tracking branch 'upstream/master'
Mar 26, 2017
43456a5
Merge remote-tracking branch 'upstream/master'
Mar 28, 2017
faa5c5c
Merge remote-tracking branch 'upstream/master'
Mar 29, 2017
1c90e7e
Merge remote-tracking branch 'upstream/master'
Apr 1, 2017
278c2fb
Merge remote-tracking branch 'upstream/master'
Apr 22, 2017
a8cd752
Merge remote-tracking branch 'upstream/master'
May 16, 2017
bbdea4b
Adding failing tests
Apr 22, 2017
d2e26ac
Code passing new tests from issue GH 15832 (but breaking 2 other test…
May 16, 2017
3c868a4
Fixing 1 of 2 new issues: now only raises if trying to coerce and dty…
May 17, 2017
20ac5c6
Fixing 2nd issue: now it raises only if the coercion from float to in…
May 17, 2017
14ed83b
Fixing broken tests
May 19, 2017
3d0e76f
Adding maybe_cast_to_integer
May 19, 2017
1726408
Fixing pytables test
May 20, 2017
1f8e9b7
Fixing small issue on base index
May 20, 2017
83cfc5d
Fixing last issue on base index
May 20, 2017
417188a
Fixing another issue on base index
May 20, 2017
939ae11
Adjusting coercion tests in indexing
May 20, 2017
86e9d5e
Fixing linting problems
May 20, 2017
359086d
Adding all @jreback comments
May 20, 2017
50950f5
Adding tests for all ints and adjust pytables test
May 22, 2017
012fb57
Implementing final adjustments from @jreback and @gfyoung
May 29, 2017
35a5ff1
Merge branch 'master' into bug-fix-15832
May 29, 2017
b1e6632
Adding final comments from @gfyoung
May 29, 2017
a1033cb
Merge branch 'bug-fix-15832' of github.com:ucals/pandas into bug-fix-…
May 29, 2017
b78f4cc
Adding final comments from @jreback
May 30, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Backwards incompatible API changes

Other API Changes
^^^^^^^^^^^^^^^^^

- Series and Index constructors now raises when data is incompatible with a passed dtype= kwarg (:issue:`15832`)
- Moved definition of ``MergeError`` to the ``pandas.errors`` module.


Expand Down
54 changes: 54 additions & 0 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
is_timedelta64_dtype, is_dtype_equal,
is_float_dtype, is_complex_dtype,
is_integer_dtype,
is_unsigned_integer_dtype,
is_datetime_or_timedelta_dtype,
is_bool_dtype, is_scalar,
_string_dtypes,
Expand Down Expand Up @@ -1026,3 +1027,56 @@ def find_common_type(types):
return np.object

return np.find_common_type(types, [])


def maybe_cast_to_integer_array(arr, dtype, copy=False):
"""
Takes any dtype and returns the casted version, raising for when data is
incompatible with integer/unsigned integer dtypes.

.. versionadded:: 0.21.0

Parameters
----------
arr : ndarray
dtype : np.dtype
copy: boolean, default False

Returns
-------
integer or unsigned integer array

Raises
------
OverflowError
* If ``dtype`` is incompatible
ValueError
* If coercion from float to integer loses precision

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of good documentation, let's add some examples here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Nice examples.

Examples
--------
If you try to coerce negative values to unsigned integers, it raises:

>>> Series([-1], dtype='uint64')
Traceback (most recent call last):
...
OverflowError: Trying to coerce negative values to unsigned integers

Also, if you try to coerce float values to integers, it raises:
>>> Series([1, 2, 3.5], dtype='int64')
Traceback (most recent call last):
...
ValueError: Trying to coerce float values to integers

"""
casted = arr.astype(dtype, copy=copy)
if np.array(arr == casted).all():
return casted

if is_unsigned_integer_dtype(dtype) and (arr < 0).any():
raise OverflowError("Trying to coerce negative values to unsigned "
"integers")

if is_integer_dtype(dtype) and (is_float_dtype(arr) or
is_object_dtype(arr)):
raise ValueError("Trying to coerce float values to integers")
9 changes: 7 additions & 2 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

from pandas.core.dtypes.generic import ABCSeries, ABCMultiIndex, ABCPeriodIndex
from pandas.core.dtypes.missing import isnull, array_equivalent
from pandas.core.dtypes.cast import maybe_cast_to_integer_array
from pandas.core.dtypes.common import (
_ensure_int64,
_ensure_object,
Expand Down Expand Up @@ -212,11 +213,14 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
if is_integer_dtype(dtype):
inferred = lib.infer_dtype(data)
if inferred == 'integer':
data = np.array(data, copy=copy, dtype=dtype)
data = maybe_cast_to_integer_array(data, dtype,
copy=copy)
elif inferred in ['floating', 'mixed-integer-float']:
if isnull(data).any():
raise ValueError('cannot convert float '
'NaN to integer')
if inferred == 'mixed-integer-float':
maybe_cast_to_integer_array(data, dtype)

# If we are actually all equal to integers,
# then coerce to integer.
Expand Down Expand Up @@ -246,7 +250,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,

except (TypeError, ValueError) as e:
msg = str(e)
if 'cannot convert float' in msg:
if ('cannot convert float' in msg or
'Trying to coerce float values to integer') in msg:
raise

# maybe coerce to a sub-class
Expand Down
7 changes: 6 additions & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@
from pandas.core.dtypes.cast import (
maybe_upcast, infer_dtype_from_scalar,
maybe_convert_platform,
maybe_cast_to_datetime, maybe_castable)
maybe_cast_to_datetime, maybe_castable,
maybe_cast_to_integer_array)
from pandas.core.dtypes.missing import isnull, notnull

from pandas.core.common import (is_bool_indexer,
Expand Down Expand Up @@ -2941,9 +2942,13 @@ def _try_cast(arr, take_fast_path):
return arr

try:
if is_float_dtype(dtype) or is_integer_dtype(dtype):
subarr = maybe_cast_to_integer_array(np.asarray(arr), dtype)

subarr = maybe_cast_to_datetime(arr, dtype)
if not is_extension_type(subarr):
subarr = np.array(subarr, dtype=dtype, copy=copy)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to move this entire section (e.g. _try_cast) to pandas/types/cast. but I will do that after

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to move part/all of this ok with that as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

except (ValueError, TypeError):
if is_categorical_dtype(dtype):
subarr = Categorical(arr)
Expand Down
21 changes: 21 additions & 0 deletions pandas/tests/indexes/test_numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,20 @@ def test_astype(self):
i = Float64Index([0, 1.1, np.NAN])
pytest.raises(ValueError, lambda: i.astype(dtype))

@pytest.mark.parametrize("int_dtype", ['uint8', 'uint16', 'uint32',
'uint64', 'int32', 'int64', 'int16',
'int8'])
@pytest.mark.parametrize("float_dtype", ['float16', 'float32'])
def test_type_coercion(self, int_dtype, float_dtype):

# GH 15832
msg = 'Trying to coerce float values to integers'
with tm.assert_raises_regex(ValueError, msg):
Index([1, 2, 3.5], dtype=int_dtype)

i = Index([1, 2, 3.5], dtype=float_dtype)
tm.assert_index_equal(i, Index([1, 2, 3.5]))

def test_equals_numeric(self):

i = Float64Index([1.0, 2.0])
Expand Down Expand Up @@ -678,6 +692,13 @@ def test_constructor_corner(self):
with tm.assert_raises_regex(TypeError, 'casting'):
Int64Index(arr_with_floats)

@pytest.mark.parametrize("uints", ['uint8', 'uint16', 'uint32', 'uint64'])
def test_constructor_overflow_coercion_signed_to_unsigned(self, uints):
# GH 15832
msg = 'Trying to coerce negative values to unsigned integers'
with tm.assert_raises_regex(OverflowError, msg):
Index([-1], dtype=uints)

def test_coerce_list(self):
# coerce things
arr = Index([1, 2, 3, 4])
Expand Down
5 changes: 3 additions & 2 deletions pandas/tests/io/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -2080,7 +2080,7 @@ def test_table_values_dtypes_roundtrip(self):
assert df1.dtypes[0] == 'float32'

# check with mixed dtypes
df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))
df1 = DataFrame(dict([(c, Series(np.random.randn(5).astype(c)))
for c in ['float32', 'float64', 'int32',
'int64', 'int16', 'int8']]))
df1['string'] = 'foo'
Expand All @@ -2094,7 +2094,8 @@ def test_table_values_dtypes_roundtrip(self):
result = store.select('df_mixed_dtypes1').get_dtype_counts()
expected = Series({'float32': 2, 'float64': 1, 'int32': 1,
'bool': 1, 'int16': 1, 'int8': 1,
'int64': 1, 'object': 1, 'datetime64[ns]': 2})
'int64': 1, 'object': 1,
'datetime64[ns]': 2})
result = result.sort_index()
result = expected.sort_index()
tm.assert_series_equal(result, expected)
Expand Down
49 changes: 34 additions & 15 deletions pandas/tests/series/test_constructors.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,26 @@
# coding=utf-8
# pylint: disable-msg=E1101,W0612

import pytest

from datetime import datetime, timedelta

from numpy import nan
import numpy as np
import numpy.ma as ma
import pandas as pd

from pandas.core.dtypes.common import (
is_categorical_dtype,
is_datetime64tz_dtype)
import pytest
from numpy import nan
from pandas import (Index, Series, isnull, date_range,
NaT, period_range, MultiIndex, IntervalIndex)
from pandas.core.indexes.datetimes import Timestamp, DatetimeIndex
from pandas import compat
from pandas.compat import lrange, range, zip, OrderedDict, long

import pandas.util.testing as tm
from pandas._libs import lib
from pandas._libs.tslib import iNaT

from pandas.compat import lrange, range, zip, OrderedDict, long
from pandas import compat
from pandas.core.dtypes.common import (
is_categorical_dtype,
is_datetime64tz_dtype)
from pandas.core.indexes.datetimes import Timestamp, DatetimeIndex
from pandas.util.testing import assert_series_equal
import pandas.util.testing as tm

from .common import TestData


Expand Down Expand Up @@ -301,12 +297,35 @@ def test_constructor_pass_nan_nat(self):
tm.assert_series_equal(Series(np.array([np.nan, pd.NaT])), exp)

def test_constructor_cast(self):
Copy link
Member

@gfyoung gfyoung May 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test name isn't particularly informative. Let's break this test up so that we can check for specific errors and also utilize pytest.mark.parametrize for cleaner tests.

@pytest.mark.parametrize(...)
def test_constructor_unsigned_dtype_overflow(self):
...

@pytest.mark.parametrize(...)
def test_constructor_coerce_float_fail(self:
...

Also, I prefer if we can use tm.assert_raises_regex to specify specific error messages instead of just testing for the Exception type. That's a stronger test IMO. The test above:

pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float)

I can begin to see / understand why it fails, but what is the exact reason where it is breaking? If it is separate from the ones you added, let's make that a test of its own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ucals : You broke up the tests, but your new tests still use pytest.raises when tm.assert_raises_regex would be more useful.

pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float)
msg = "could not convert string to float"
with tm.assert_raises_regex(ValueError, msg):
Series(['a', 'b', 'c'], dtype=float)

@pytest.mark.parametrize("unsigned_integers", ['uint8', 'uint16', 'uint32',
'uint64'])
def test_constructor_unsigned_dtype_overflow(self, unsigned_integers):
# GH 15832
msg = 'Trying to coerce negative values to unsigned integers'
with tm.assert_raises_regex(OverflowError, msg):
Series([-1], dtype=unsigned_integers)

@pytest.mark.parametrize("integers", ['uint8', 'uint16', 'uint32',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use some more informative names like int_dtype (and float_dtype, etc. for tests where you parametrized on dtype).

'uint64', 'int32', 'int64', 'int16',
'int8'])
@pytest.mark.parametrize("floats", ['float16', 'float32'])
def test_constructor_coerce_float_fail(self, integers, floats):
# GH 15832
msg = 'Trying to coerce float values to integers'
with tm.assert_raises_regex(ValueError, msg):
Series([1, 2, 3.5], dtype=integers)

s = Series([1, 2, 3.5], dtype=floats)
expected = Series([1, 2, 3.5]).astype(floats)
assert_series_equal(s, expected)

def test_constructor_dtype_nocast(self):
# 1572
s = Series([1, 2, 3])

s2 = Series(s, dtype=np.int64)

s2[1] = 5
Expand Down