Skip to content

Gh 36562 typeerror comparison not supported between float and str #37096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Nov 4, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1320ff1
Fixed #36562
ssche Oct 13, 2020
a1b9385
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 13, 2020
1f76d21
Fixed flake issue
ssche Oct 13, 2020
7076841
fixed linting errors
ssche Oct 13, 2020
225675d
Addressed review comments
ssche Oct 14, 2020
2677166
Make `sort_tuples` last resort sorting strategy to allow faster sorte…
ssche Oct 16, 2020
6f476bc
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 16, 2020
3688238
manually apply pandas black changes (from CI)
ssche Oct 16, 2020
aba429c
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 16, 2020
8ae9279
Modified changelog message as per review comments
ssche Oct 21, 2020
dd5a38d
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 21, 2020
7b7d6f8
Removed pd.xyz
ssche Oct 22, 2020
4226662
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 22, 2020
8cbfa01
forgot `pd`
ssche Oct 22, 2020
6d71000
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 26, 2020
92e1e33
Changed sorting algorithm by using expanded array
ssche Oct 31, 2020
3ada9ce
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 31, 2020
95670f1
Add ticket ref to test case
ssche Oct 31, 2020
d9dcd22
sort import
ssche Oct 31, 2020
66c30c7
Address review comments
ssche Oct 31, 2020
db66528
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Oct 31, 2020
16ae4f4
Merge remote-tracking branch 'upstream/master' into gh-36562-typeerro…
ssche Nov 3, 2020
9d03dc3
Fixed: committed file was in merge state
ssche Nov 3, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@ MultiIndex

- Bug in :meth:`DataFrame.xs` when used with :class:`IndexSlice` raises ``TypeError`` with message ``"Expected label or tuple of labels"`` (:issue:`35301`)
- Bug in :meth:`DataFrame.reset_index` with ``NaT`` values in index raises ``ValueError`` with message ``"cannot convert float NaN to integer"`` (:issue:`36541`)
- Bug in :meth:`DataFrame.combine_first` when used with :class:`MultiIndex` containing string and ``NaN`` values raises ``TypeError`` with message ``"'<' not supported between instances of 'float' and 'str'"`` (:issue:`36562`)
-

I/O
Expand Down
48 changes: 46 additions & 2 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
"""
from __future__ import annotations

import functools
import operator
from textwrap import dedent
from typing import TYPE_CHECKING, Dict, Optional, Tuple, Union, cast
Expand Down Expand Up @@ -2055,20 +2056,63 @@ def sort_mixed(values):
strs = np.sort(values[str_pos])
return np.concatenate([nums, np.asarray(strs, dtype=object)])

def sort_tuples(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you simply try to remove the nan's from values and add them to the end is fine and just fix up sort_mixed, we do this in a number of places e.g. something like

In [204]: arr = np.array(['foo', 3, np.nan], dtype=object)                                                                                                                                                             

In [205]: mask = pd.notna(arr)                                                                                                                                                                                         

In [206]: arr.take(np.arange(len(arr))[mask])                                                                                                                                                                          
Out[206]: array(['foo', 3], dtype=object)

Copy link
Contributor Author

@ssche ssche Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started off with trying to fix sort_mixed and removed nans as you suggested and would have loved to use it instead. however the structure of the array doesn't allow for that (we are talking about a 1-d array of tuples, i.e. dtype object, not an N-d array which would allow for all those vector operations to be applicable).

here's the data of values that gets passed in to sort_tuples() when running the test I added:

In[2]: 
values
Out[2]: 
array([('b', 1), ('b', 2), ('c', 3), ('a', 4), ('b', 5), (nan, 6),
       ('a', 1), ('c', 1), ('d', 1)], dtype=object)
values[0]
Out[3]: ('b', 1)
values.shape
Out[4]: (9,)

I could convert value to a N-d array (and use sort_mixed), but that solution would come with its own overhead costs...

In[5]: 
np.asarray(list(values))
Out[5]: 
array([['b', '1'],
       ['b', '2'],
       ['c', '3'],
       ['a', '4'],
       ['b', '5'],
       ['nan', '6'],
       ['a', '1'],
       ['c', '1'],
       ['d', '1']], dtype='<U3')

your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any feedback on the above, @jreback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i care about code complexity here, this is adding a lot, try this

# sorts tuples with mixed values. can handle nan vs string comparisons.
def cmp_func(index_x, index_y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do cmp_func(left, right)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's change the name and please add typing for cmp_func and sort_tuples. (sort_mixed if you can as well :->)

x = values[index_x]
y = values[index_y]
# shortcut loop in case both tuples are the same
Copy link
Contributor

@jreback jreback Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to do any of this

see safe_sort

Copy link
Contributor Author

@ssche ssche Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you saying? what isn't needed?

this PR is changing safe_sort to accommodate mixed-type tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and how does this not do it now!

iirc this is adding a lot of non performing code for the purpose of fixing the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and how does this not do it now!

now a TypeError will be raised in df.combine_first when the df contains a MultiIndex with nan and string values (sort_mixed, which is used now, fails in that case).

iirc this is adding a lot of non performing code for the purpose of fixing the error message?

in the event of all other (more performant) sorters failing, this (unarguably slower) sorter will be used. this should not compromise any other use case's performance, but at least makes the code in the ticket description succeed (slower, but at least not failing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I moved the slow sort_tuples method so it really isn't called when any of the other (faster) sorters can work (2677166#diff-c8f3ad29eaf121537b999e88e9117f3e3702d0b818a67516da25093fe2890ce8R2114). please have another look and provide feedback.

if x == y:
return 0
# lexicographic sorting
for i in range(max(len(x), len(y))):
# check if the tuples have different lengths (shorter tuples
# first)
if i >= len(x):
return -1
if i >= len(y):
return +1
x_is_na = isna(x[i])
y_is_na = isna(y[i])
# values are the same -> resolve tie with next element
if (x_is_na and y_is_na) or (x[i] == y[i]):
continue
# check for nan values (sort nan to the end)
if x_is_na and not y_is_na:
return +1
if not x_is_na and y_is_na:
return -1
# normal greater/less than comparison
if x[i] < y[i]:
return -1
return +1
# both values are the same (should already have been caught)
return 0

ixs = np.arange(len(values))
ixs = sorted(ixs, key=functools.cmp_to_key(cmp_func))
return values[ixs]

sorter = None

if (
not is_extension_array_dtype(values)
and lib.infer_dtype(values, skipna=False) == "mixed-integer"
):
# unorderable in py3 if mixed str/int
ordered = sort_mixed(values)
else:
try:
sorter = values.argsort()
ordered = values.take(sorter)
except TypeError:
# try this anyway
ordered = sort_mixed(values)
try:
ordered = sort_mixed(values)
except TypeError:
if values.size and isinstance(values[0], tuple):
ordered = sort_tuples(values)
else:
raise

# codes:

Expand Down
27 changes: 27 additions & 0 deletions pandas/tests/indexing/multiindex/test_multiindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,30 @@ def test_multiindex_get_loc_list_raises(self):
msg = "unhashable type"
with pytest.raises(TypeError, match=msg):
idx.get_loc([])


def test_combine_first_with_nan_multiindex():
mi1 = pd.MultiIndex.from_arrays(
[["b", "b", "c", "a", "b", np.nan], [1, 2, 3, 4, 5, 6]], names=["a", "b"]
)
df = pd.DataFrame({"c": [1, 1, 1, 1, 1, 1]}, index=mi1)
mi2 = pd.MultiIndex.from_arrays(
[["a", "b", "c", "a", "b", "d"], [1, 1, 1, 1, 1, 1]], names=["a", "b"]
)
s = pd.Series([1, 2, 3, 4, 5, 6], index=mi2)
res = df.combine_first(pd.DataFrame({"d": s}))
mi_expected = pd.MultiIndex.from_arrays(
[
["a", "a", "a", "b", "b", "b", "b", "c", "c", "d", np.nan],
[1, 1, 4, 1, 1, 2, 5, 1, 3, 1, 6],
],
names=["a", "b"],
)
expected = pd.DataFrame(
{
"c": [np.nan, np.nan, 1, 1, 1, 1, 1, np.nan, 1, np.nan, 1],
"d": [1.0, 4.0, np.nan, 2.0, 5.0, np.nan, np.nan, 3.0, np.nan, 6.0, np.nan],
},
index=mi_expected,
)
tm.assert_frame_equal(res, expected)
7 changes: 7 additions & 0 deletions pandas/tests/test_sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -453,3 +453,10 @@ def test_extension_array_codes(self, verify, na_sentinel):
expected_codes = np.array([0, 2, na_sentinel, 1], dtype=np.intp)
tm.assert_extension_array_equal(result, expected_values)
tm.assert_numpy_array_equal(codes, expected_codes)


def test_mixed_str_nan():
values = np.array(["b", np.nan, "a", "b"], dtype=object)
result = safe_sort(values)
expected = np.array([np.nan, "a", "b", "b"], dtype=object)
tm.assert_numpy_array_equal(result, expected)