Skip to content

BUG: df.loc setitem-with-expansion with duplicate index #40096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 2, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

also CategoricalIndex.insert

@jbrockmendel
Copy link
Member Author

cc @phofl


exp_index = index.insert(len(index), key)
if isinstance(index, MultiIndex):
exp_index = index.insert(len(index), key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not needed?

LGTM otherwise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks

df.loc["d", "A"] = 10
with pytest.raises(TypeError, match=msg):
df.loc["d", "C"] = 10
# Setting-with-expansion with a new key "d" that is not among caegories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth splitting this test (e.g. errors to a new tests) and even these cases

expected = DataFrame(exp_data, index=exp_index, columns=[0])

# Add new row, but no new columns
df = orig.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same might be worth splitting this up a bit

@@ -1641,7 +1641,12 @@ def _setitem_with_indexer(self, indexer, value, name="iloc"):
# so the object is the same
index = self.obj._get_axis(i)
labels = index.insert(len(index), key)
self.obj._mgr = self.obj.reindex(labels, axis=i)._mgr
taker = list(range(len(index))) + [-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add comments on what is happening here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob doesn't make much difference but can you just use np.arange here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [2]: %timeit list(np.arange(10))
1.6 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [3]: %timeit list(range(10))
238 ns ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • sure that's small sample, is it always really small like this?
  • needs to have ensure_int_platform generally (you can directly construct the np array like that)
  • we always use np.arange for this so changing patterns is not helpful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure that's small sample, is it always really small like this?

In [2]: N = 10**6

In [3]: %timeit list(np.arange(N))
79.6 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit list(range(N))
32.9 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, better idea, never mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N  = 10

In [5]: %timeit arr = np.asarray(list(range(N)) + [1], dtype=np.intp)
2.52 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [6]: %timeit arr = np.arange(N + 1); arr[-1] = -1
533 ns ± 11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

N = 10**6

In [8]: %timeit arr = np.arange(N + 1); arr[-1] = -1
397 µs ± 24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit arr = np.asarray(list(range(N)) + [1], dtype=np.intp)
98.6 ms ± 437 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 27, 2021
@jreback
Copy link
Contributor

jreback commented Feb 27, 2021

I think we should deprecate indexing expansion, pretty sure we have an issue for this.

@jbrockmendel
Copy link
Member Author

updted per comments + greenish

@jreback jreback added this to the 1.3 milestone Mar 2, 2021
@jreback jreback merged commit 13a97c2 into pandas-dev:master Mar 2, 2021
@jbrockmendel jbrockmendel deleted the bug-ci-insert branch March 2, 2021 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants