Skip to content

BUG: pd.cut with duplicate index Series lowest included #42425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
47e3db1
BUG: pd.cut with duplicate index Series lowest inclued
debnathshoham Jul 7, 2021
a9b6e5f
REF: disambiguate get_loc_level k variable (#42378)
jbrockmendel Jul 7, 2021
0a9cfcf
CI: update vm image version for Azure (#42419)
fangchenli Jul 7, 2021
62147ea
added tests
debnathshoham Jul 8, 2021
f8bcd67
updated whatsnew
debnathshoham Jul 8, 2021
da4ea96
corrected # GH code to 42185
debnathshoham Jul 8, 2021
2500a23
DEPR: treating dt64 as UTC in Timestamp constructor (#42288)
jbrockmendel Jul 8, 2021
82eb380
PERF/REGR: revert #41785 (#42338)
jbrockmendel Jul 8, 2021
ec0fdb7
Update doc/source/whatsnew/v1.3.1.rst
debnathshoham Jul 8, 2021
35b338e
BUG: .loc failing to drop first level (#42435)
jbrockmendel Jul 8, 2021
8d64fe9
BUG: truncate has incorrect behavior when index has only one unique v…
neelmraman Jul 8, 2021
487aafb
CLN: clean doc validation script (#42436)
fangchenli Jul 8, 2021
afeb35e
Fix Formatting Issue (#42438)
9t8 Jul 8, 2021
29e6dc0
DOC: Add more instructions for updating the whatsnew (#42427)
rhshadrach Jul 8, 2021
4653b6a
DOC fix the incorrect doc style in 1.2.1 (#42386)
debnathshoham Jul 8, 2021
8240473
BUG: pd.cut with duplicate index Series lowest inclued
debnathshoham Jul 7, 2021
52f53fd
added tests
debnathshoham Jul 8, 2021
9cb8dbb
updated whatsnew
debnathshoham Jul 8, 2021
a74c2f6
corrected # GH code to 42185
debnathshoham Jul 8, 2021
57b5ddb
updated test func name and code
debnathshoham Jul 8, 2021
6498e50
made suggested changes
debnathshoham Jul 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.3.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Fixed regressions

Bug fixes
~~~~~~~~~
-
- Fixed bugs pertaining to :meth:`pandas.cut` operations on :class:`Series` with duplicate indices(:issue:`42185`) and non-exact :meth:`pandas.CategoricalIndex` (:issue:`42425`)
-

.. ---------------------------------------------------------------------------
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/reshape/tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -421,7 +421,10 @@ def _bins_to_cuts(
ids = ensure_platform_int(bins.searchsorted(x, side=side))

if include_lowest:
ids[x == bins[0]] = 1
if isinstance(x, ABCSeries):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try using
ids[np.asarray(x) == bins[0]] = 1 instead of the if/else

ids[x.values == bins[0]] = 1
else:
ids[x == bins[0]] = 1

na_mask = isna(x) | (ids == len(bins)) | (ids == 0)
has_nas = na_mask.any()
Expand Down
45 changes: 45 additions & 0 deletions pandas/tests/reshape/test_cut.py
Original file line number Diff line number Diff line change
Expand Up @@ -691,3 +691,48 @@ def test_cut_no_warnings():
labels = [f"{i} - {i + 9}" for i in range(0, 100, 10)]
with tm.assert_produces_warning(False):
df["group"] = cut(df.value, range(0, 105, 10), right=False, labels=labels)


def test_cut_with_duplicated_index_lowest_included():
# GH 42185
expected = Series(
[Interval(-0.001, 2, closed="right")] * 3
+ [Interval(2, 4, closed="right"), Interval(-0.001, 2, closed="right")],
index=[0, 1, 2, 3, 0],
dtype="category",
).cat.as_ordered()

s = Series([0, 1, 2, 3, 0], index=[0, 1, 2, 3, 0])
result = cut(s, bins=[0, 2, 4], include_lowest=True)
tm.assert_series_equal(result, expected)


def df_from_series_with_nonexact_categoricalindices_frompdcut():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't execute as its not named correctly e.g. name similar to the above

# GH 42424

ser = Series(range(0, 100))
ser1 = cut(ser, 10).value_counts().head(5)
ser2 = cut(ser, 10).value_counts().tail(5)
result = DataFrame({"1": ser1, "2": ser2})

index = pd.CategoricalIndex(
[
Interval(-0.099, 9.9, closed="right"),
Interval(9.9, 19.8, closed="right"),
Interval(19.8, 29.7, closed="right"),
Interval(29.7, 39.6, closed="right"),
Interval(39.6, 49.5, closed="right"),
Interval(49.5, 59.4, closed="right"),
Interval(59.4, 69.3, closed="right"),
Interval(69.3, 79.2, closed="right"),
Interval(79.2, 89.1, closed="right"),
Interval(89.1, 99, closed="right"),
],
ordered=True,
)

expected = DataFrame(
{"1": [10] * 5 + [np.nan] * 5, "2": [np.nan] * 5 + [10] * 5}, index=index
)

tm.assert_frame_equal(expected, result)