Skip to content

BUG: ensuring that np.asarray() simple handles data as objects and doesn't… #22161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 10, 2018
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -685,6 +685,5 @@ Other
- :meth: `~pandas.io.formats.style.Styler.background_gradient` now takes a ``text_color_threshold`` parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (:issue:`21258`)
- Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`)
- :meth: `~pandas.io.formats.style.Styler.background_gradient` now also supports tablewise application (in addition to rowwise and columnwise) with ``axis=None`` (:issue:`15204`)
-
-
- :meth:`pandas.core.algorithms.isin` avoids spurious casting for lists (:issue:`22160`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this user visible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Only when the user uses pandas.core.algorithms.isin directly, then the wrong behavior from #22160 is fixed. There is however no difference if isin is used via Series or Index - the values are already in a np.array and thus the bug ins't triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , this is an internal, routine, ok removing this whatsnew note.

-
2 changes: 1 addition & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ def _ensure_data(values, dtype=None):
return values, dtype, 'int64'

# we have failed, return object
values = np.asarray(values)
values = np.asarray(values, dtype=np.object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we actually should prob use pandas.core.dtypes.cast.construct_1d_array_preserving_na which is even better here. further pls run the performance suite for things like factorize, value_counts, isin, this a very performance sensitive section.

Copy link
Contributor Author

@realead realead Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Actually, pandas.core.dtypes.cast.construct_1d_ndarray_preserving_na would not work for two reasons:

  1. For [42, 's'] it returns array(['42', 's'], dtype='<U11') and not the wanted array([42, 's'], dtype=object)), not sure this is the intended behavior of the function though
  2. For [np.nan] it returns array([nan], dtype=float64) which leads to result[0] is np.nan being False, but we would like to keep the id of the object.

return ensure_object(values), 'object', 'object'


Expand Down
62 changes: 62 additions & 0 deletions pandas/tests/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -615,6 +615,68 @@ def test_categorical_from_codes(self):
result = algos.isin(Sd, St)
tm.assert_numpy_array_equal(expected, result)

def test_same_nan_is_in(self):
# GH 22160
# nan is special, because from " a is b" doesn't follow "a == b"
# at least, isin() should follow python's "np.nan in [nan] == True"
# casting to -> np.float64 -> another float-object somewher on
# the way could lead jepardize this behavior
comps = [np.nan] # could be casted to float64
values = [np.nan]
expected = np.array([True])
result = algos.isin(comps, values)
tm.assert_numpy_array_equal(expected, result)

def test_same_object_is_in(self):
# GH 22160
# there could be special treatment for nans
# the user however could define a custom class
# with similar behavior, then we at least should
# fall back to usual python's behavior: "a in [a] == True"
class LikeNan(object):
def __eq__(self):
return False

def __hash__(self):
return 0

a, b = LikeNan(), LikeNan()
# same object -> True
tm.assert_numpy_array_equal(algos.isin([a], [a]), np.array([True]))
# different objects -> False
tm.assert_numpy_array_equal(algos.isin([a], [b]), np.array([False]))

def test_different_nans(self):
# GH 22160
# all nans are handled as equivalent

comps = [float('nan')]
values = [float('nan')]
assert comps[0] is not values[0] # different nan-objects

# as list of python-objects:
result = algos.isin(comps, values)
tm.assert_numpy_array_equal(np.array([True]), result)

# as object-array:
result = algos.isin(np.asarray(comps, dtype=np.object),
np.asarray(values, dtype=np.object))
tm.assert_numpy_array_equal(np.array([True]), result)

# as float64-array:
result = algos.isin(np.asarray(comps, dtype=np.float64),
np.asarray(values, dtype=np.float64))
tm.assert_numpy_array_equal(np.array([True]), result)

def test_no_cast(self):
# GH 22160
# ensure 42 is not casted to a string
comps = ['ss', 42]
values = ['42']
expected = np.array([False, False])
result = algos.isin(comps, values)
tm.assert_numpy_array_equal(expected, result)

@pytest.mark.parametrize("empty", [[], Series(), np.array([])])
def test_empty(self, empty):
# see gh-16991
Expand Down