-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: df.loc setitem-with-expansion with duplicate index #40096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @phofl |
pandas/tests/indexing/test_loc.py
Outdated
|
||
exp_index = index.insert(len(index), key) | ||
if isinstance(index, MultiIndex): | ||
exp_index = index.insert(len(index), key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not needed?
LGTM otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, thanks
df.loc["d", "A"] = 10 | ||
with pytest.raises(TypeError, match=msg): | ||
df.loc["d", "C"] = 10 | ||
# Setting-with-expansion with a new key "d" that is not among caegories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth splitting this test (e.g. errors to a new tests) and even these cases
expected = DataFrame(exp_data, index=exp_index, columns=[0]) | ||
|
||
# Add new row, but no new columns | ||
df = orig.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same might be worth splitting this up a bit
pandas/core/indexing.py
Outdated
@@ -1641,7 +1641,12 @@ def _setitem_with_indexer(self, indexer, value, name="iloc"): | |||
# so the object is the same | |||
index = self.obj._get_axis(i) | |||
labels = index.insert(len(index), key) | |||
self.obj._mgr = self.obj.reindex(labels, axis=i)._mgr | |||
taker = list(range(len(index))) + [-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add comments on what is happening here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob doesn't make much difference but can you just use np.arange here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [2]: %timeit list(np.arange(10))
1.6 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [3]: %timeit list(range(10))
238 ns ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- sure that's small sample, is it always really small like this?
- needs to have ensure_int_platform generally (you can directly construct the np array like that)
- we always use np.arange for this so changing patterns is not helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure that's small sample, is it always really small like this?
In [2]: N = 10**6
In [3]: %timeit list(np.arange(N))
79.6 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit list(range(N))
32.9 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, better idea, never mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N = 10
In [5]: %timeit arr = np.asarray(list(range(N)) + [1], dtype=np.intp)
2.52 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [6]: %timeit arr = np.arange(N + 1); arr[-1] = -1
533 ns ± 11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
N = 10**6
In [8]: %timeit arr = np.arange(N + 1); arr[-1] = -1
397 µs ± 24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit arr = np.asarray(list(range(N)) + [1], dtype=np.intp)
98.6 ms ± 437 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I think we should deprecate indexing expansion, pretty sure we have an issue for this. |
updted per comments + greenish |
also CategoricalIndex.insert