Skip to content

API/BUG: Handling Dtype Coercions in Series/Index (GH 15832) #15859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
676a4e5
Test
Mar 20, 2017
e12bca7
Sync fork
Mar 24, 2017
9fc617b
Merge remote-tracking branch 'upstream/master'
Mar 26, 2017
8b463cb
Merge remote-tracking branch 'upstream/master'
Mar 26, 2017
43456a5
Merge remote-tracking branch 'upstream/master'
Mar 28, 2017
faa5c5c
Merge remote-tracking branch 'upstream/master'
Mar 29, 2017
1c90e7e
Merge remote-tracking branch 'upstream/master'
Apr 1, 2017
278c2fb
Merge remote-tracking branch 'upstream/master'
Apr 22, 2017
a8cd752
Merge remote-tracking branch 'upstream/master'
May 16, 2017
bbdea4b
Adding failing tests
Apr 22, 2017
d2e26ac
Code passing new tests from issue GH 15832 (but breaking 2 other test…
May 16, 2017
3c868a4
Fixing 1 of 2 new issues: now only raises if trying to coerce and dty…
May 17, 2017
20ac5c6
Fixing 2nd issue: now it raises only if the coercion from float to in…
May 17, 2017
14ed83b
Fixing broken tests
May 19, 2017
3d0e76f
Adding maybe_cast_to_integer
May 19, 2017
1726408
Fixing pytables test
May 20, 2017
1f8e9b7
Fixing small issue on base index
May 20, 2017
83cfc5d
Fixing last issue on base index
May 20, 2017
417188a
Fixing another issue on base index
May 20, 2017
939ae11
Adjusting coercion tests in indexing
May 20, 2017
86e9d5e
Fixing linting problems
May 20, 2017
359086d
Adding all @jreback comments
May 20, 2017
50950f5
Adding tests for all ints and adjust pytables test
May 22, 2017
012fb57
Implementing final adjustments from @jreback and @gfyoung
May 29, 2017
35a5ff1
Merge branch 'master' into bug-fix-15832
May 29, 2017
b1e6632
Adding final comments from @gfyoung
May 29, 2017
a1033cb
Merge branch 'bug-fix-15832' of github.com:ucals/pandas into bug-fix-…
May 29, 2017
b78f4cc
Adding final comments from @jreback
May 30, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
is_timedelta64_dtype, is_dtype_equal,
is_float_dtype, is_complex_dtype,
is_integer_dtype,
is_unsigned_integer_dtype,
is_datetime_or_timedelta_dtype,
is_bool_dtype, is_scalar,
_string_dtypes,
Expand Down Expand Up @@ -1026,3 +1027,37 @@ def find_common_type(types):
return np.object

return np.find_common_type(types, [])


def maybe_cast_to_integer(arr, dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to maybe_cast_to_integer_array

Copy link
Member

@gfyoung gfyoung May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : Perhaps we could just generalize to accept scalar as well as array (as consistent with other casting functions in this module) and keep the name as is?

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually can take any dtype and will return a casted version. It happens to check for integer/unsigned integer dtypes. So pls expand a little.

Takes an integer dtype and returns the casted version, raising for an
incompatible dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag (even though this is private, nice to have)


Parameters
----------
arr : ndarray
dtype : np.dtype

Returns
-------
integer or unsigned integer array

Raises
------
OverflowError
* If ``dtype`` is incompatible
ValueError
* If coercion from float to integer loses precision

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of good documentation, let's add some examples here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Nice examples.

"""

if is_unsigned_integer_dtype(dtype) and (arr < 0).any():
raise OverflowError("Trying to coerce negative values to negative "
"integers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this then needs to cast, no?

e.g. (before the return)

arr = arr.astype(dtype, copy=False)

elif is_integer_dtype(dtype) and (is_float_dtype(arr) or
is_object_dtype(arr)):
if not (arr == arr.astype(dtype)).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do

casted = arr.astype(dtype, copy=False)
if (arr == casted).all():
    return casted
raise ValueError(......)

a bit more code but then avoid coercion twice.

raise ValueError("Trying to coerce float values to integers")

return arr.astype(dtype, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should cast here. This will coerce non-integer arrays to the passed dtype, which may not be nice.

instead branch on each of the if's (the integer and unsigned cases) and do an astype or raise).

so this will then explicity only work on integer dtypes that are passed, and not implicity on anything else

6 changes: 6 additions & 0 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

from pandas.core.dtypes.generic import ABCSeries, ABCMultiIndex, ABCPeriodIndex
from pandas.core.dtypes.missing import isnull, array_equivalent
from pandas.core.dtypes.cast import maybe_cast_to_integer
from pandas.core.dtypes.common import (
_ensure_int64,
_ensure_object,
Expand Down Expand Up @@ -212,11 +213,14 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
if is_integer_dtype(dtype):
inferred = lib.infer_dtype(data)
if inferred == 'integer':
data = maybe_cast_to_integer(data, dtype=dtype)
data = np.array(data, copy=copy, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the np.array casting line now

elif inferred in ['floating', 'mixed-integer-float']:
if isnull(data).any():
raise ValueError('cannot convert float '
'NaN to integer')
if inferred == 'mixed-integer-float':
maybe_cast_to_integer(data, dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be assigned?


# If we are actually all equal to integers,
# then coerce to integer.
Expand Down Expand Up @@ -248,6 +252,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None,
msg = str(e)
if 'cannot convert float' in msg:
raise
if 'Trying to coerce float values to integer' in msg:
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do an or statement? That seems more compact.


# maybe coerce to a sub-class
from pandas.core.indexes.period import (
Expand Down
7 changes: 6 additions & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@
from pandas.core.dtypes.cast import (
maybe_upcast, infer_dtype_from_scalar,
maybe_convert_platform,
maybe_cast_to_datetime, maybe_castable)
maybe_cast_to_datetime, maybe_castable,
maybe_cast_to_integer)
from pandas.core.dtypes.missing import isnull, notnull

from pandas.core.common import (is_bool_indexer,
Expand Down Expand Up @@ -2916,9 +2917,13 @@ def _try_cast(arr, take_fast_path):
return arr

try:
if is_float_dtype(dtype) or is_integer_dtype(dtype):
subarr = maybe_cast_to_integer(np.asarray(arr), dtype)

subarr = maybe_cast_to_datetime(arr, dtype)
if not is_extension_type(subarr):
subarr = np.array(subarr, dtype=dtype, copy=copy)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to move this entire section (e.g. _try_cast) to pandas/types/cast. but I will do that after

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to move part/all of this ok with that as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

except (ValueError, TypeError):
if is_categorical_dtype(dtype):
subarr = Categorical(arr)
Expand Down
18 changes: 18 additions & 0 deletions pandas/tests/indexes/test_numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,18 @@ def test_astype(self):
i = Float64Index([0, 1.1, np.NAN])
pytest.raises(ValueError, lambda: i.astype(dtype))

# GH 15832
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in a separate test

for t in ['uint8', 'uint16', 'uint32', 'uint64', 'int32', 'int64',
'int16', 'int8']:
with pytest.raises(ValueError):
Index([1, 2, 3.5], dtype=t)

try:
for t in ['float16', 'float32']:
Index([1, 2, 3.5], dtype=t)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to catch the exception
rather compare against an expected index

pytest.fail("GH 15832 should not raise for float type")

def test_equals_numeric(self):

i = Float64Index([1.0, 2.0])
Expand Down Expand Up @@ -678,6 +690,12 @@ def test_constructor_corner(self):
with tm.assert_raises_regex(TypeError, 'casting'):
Int64Index(arr_with_floats)

def test_constructor_overflow_coercion_signed_to_unsigned(self):
# GH 15832
for t in ['uint8', 'uint16', 'uint32', 'uint64']:
Copy link
Member

@gfyoung gfyoung May 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's utilize pytest.mark.parametrize because we can now 😄

Copy link
Member

@gfyoung gfyoung May 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ucals : This one has not yet been fully addressed (see test_numeric.py).

with pytest.raises(OverflowError):
Index([-1], dtype=t)

def test_coerce_list(self):
# coerce things
arr = Index([1, 2, 3, 4])
Expand Down
5 changes: 3 additions & 2 deletions pandas/tests/io/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -2080,7 +2080,7 @@ def test_table_values_dtypes_roundtrip(self):
assert df1.dtypes[0] == 'float32'

# check with mixed dtypes
df1 = DataFrame(dict([(c, Series(np.random.randn(5), dtype=c))
df1 = DataFrame(dict([(c, Series(np.random.randn(5).astype(c)))
for c in ['float32', 'float64', 'int32',
'int64', 'int16', 'int8']]))
df1['string'] = 'foo'
Expand All @@ -2094,7 +2094,8 @@ def test_table_values_dtypes_roundtrip(self):
result = store.select('df_mixed_dtypes1').get_dtype_counts()
expected = Series({'float32': 2, 'float64': 1, 'int32': 1,
'bool': 1, 'int16': 1, 'int8': 1,
'int64': 1, 'object': 1, 'datetime64[ns]': 2})
'int64': 1, 'object': 1,
'datetime64[ns]': 2})
result = result.sort_index()
result = expected.sort_index()
tm.assert_series_equal(result, expected)
Expand Down
40 changes: 27 additions & 13 deletions pandas/tests/series/test_constructors.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,26 @@
# coding=utf-8
# pylint: disable-msg=E1101,W0612

import pytest

from datetime import datetime, timedelta

from numpy import nan
import numpy as np
import numpy.ma as ma
import pandas as pd

from pandas.core.dtypes.common import (
is_categorical_dtype,
is_datetime64tz_dtype)
import pytest
from numpy import nan
from pandas import (Index, Series, isnull, date_range,
NaT, period_range, MultiIndex, IntervalIndex)
from pandas.core.indexes.datetimes import Timestamp, DatetimeIndex
from pandas import compat
from pandas.compat import lrange, range, zip, OrderedDict, long

import pandas.util.testing as tm
from pandas._libs import lib
from pandas._libs.tslib import iNaT

from pandas.compat import lrange, range, zip, OrderedDict, long
from pandas import compat
from pandas.core.dtypes.common import (
is_categorical_dtype,
is_datetime64tz_dtype)
from pandas.core.indexes.datetimes import Timestamp, DatetimeIndex
from pandas.util.testing import assert_series_equal
import pandas.util.testing as tm

from .common import TestData


Expand Down Expand Up @@ -303,9 +299,27 @@ def test_constructor_pass_nan_nat(self):
def test_constructor_cast(self):
Copy link
Member

@gfyoung gfyoung May 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test name isn't particularly informative. Let's break this test up so that we can check for specific errors and also utilize pytest.mark.parametrize for cleaner tests.

@pytest.mark.parametrize(...)
def test_constructor_unsigned_dtype_overflow(self):
...

@pytest.mark.parametrize(...)
def test_constructor_coerce_float_fail(self:
...

Also, I prefer if we can use tm.assert_raises_regex to specify specific error messages instead of just testing for the Exception type. That's a stronger test IMO. The test above:

pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float)

I can begin to see / understand why it fails, but what is the exact reason where it is breaking? If it is separate from the ones you added, let's make that a test of its own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ucals : You broke up the tests, but your new tests still use pytest.raises when tm.assert_raises_regex would be more useful.

pytest.raises(ValueError, Series, ['a', 'b', 'c'], dtype=float)

# GH 15832
for t in ['uint8', 'uint16', 'uint32', 'uint64']:
with pytest.raises(OverflowError):
Series([-1], dtype=t)

# GH 15832
for t in ['uint8', 'uint16', 'uint32', 'uint64', 'int32', 'int64',
'int16', 'int8']:
with pytest.raises(ValueError):
Series([1, 2, 3.5], dtype=t)

try:
for t in ['float16', 'float32']:
Series([1, 2, 3.5], dtype=t)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

pytest.fail("GH 15832 should not raise for float type")

def test_constructor_dtype_nocast(self):
# 1572
s = Series([1, 2, 3])
s = Series([1, 2, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dupe?


s2 = Series(s, dtype=np.int64)

Expand Down