Skip to content

WIP/REF: BlockManager.setitem_blockwise #39302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jbrockmendel
Copy link
Member

Posting largely for discussion with @phofl.

Motivations:

  1. Try to end up with fewer code paths for setitem so that we can ensure consistent inplace/view/copy behavior (xref API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896)
  2. Simplify (ATM this does not simplify things)
  3. Perf: iterating over blocks instead of columns may save some copies (though doing the Index.intersection calls may go the other way) (havent done any profiling yet)

Non-trivial overlap with #39044.
Some of this would be simplified by #39163.
No logical overlap with #39266, but will require rebase.

@pep8speaks
Copy link

Hello @jbrockmendel! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 619:89: E501 line too long (91 > 88 characters)

Line 3222:89: E501 line too long (104 > 88 characters)

@phofl
Copy link
Member

phofl commented Jan 20, 2021

Is there a specific place where I can be of help?

@jbrockmendel
Copy link
Member Author

Mostly just want to keep you in the loop on what i have in mind since you're working on similar parts of the code.

@jbrockmendel
Copy link
Member Author

@phofl this fails the test added in #39280. Thoughts on an appropriate fix?

(a branch that only has the edits to DataFrame._setitem_array has the same failure)

@phofl
Copy link
Member

phofl commented Jan 23, 2021

Not per se a fix, but I had similar difficulties in #39341

We have to account for duplicate column names in these checks. Maybe using get_indexer_for or something like that if there is a faster alternative. Simply comparing the lenght does not work with duplicates

@phofl
Copy link
Member

phofl commented Jan 24, 2021

You could use

if self.columns.is_unique:
    len_key = len(key)
else:
    len_key = len(self.columns.get_indexer_non_unique(key))
if len(value) != len_key:
    raise ValueError("Columns must be same length as key")

as a helper functions for both len(value) != len(key) checks. Test will keep failing because the column is now cast to integer, but this is good since this is consistent with other setitem cases. If you like I could implement this in #39341. I am dispatching to the else block there right now, since this does the trick currently.

@jbrockmendel jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Jan 26, 2021
@jbrockmendel jbrockmendel added the Mothballed Temporarily-closed PR the author plans to return to label Jan 28, 2021
@jbrockmendel
Copy link
Member Author

mothballing

@jbrockmendel jbrockmendel deleted the ref-setitem_inplace branch November 8, 2021 16:36
@jbrockmendel jbrockmendel removed the Mothballed Temporarily-closed PR the author plans to return to label Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants