Skip to content

REF: move reshaping of array for setitem from DataFrame into BlockManager internals #39722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 10, 2021

Conversation

jorisvandenbossche
Copy link
Member

I think ideally the DataFrame does not need to be aware of how the underlying manager stores the data (as 2D, transposed or not), so moving the logic of ensuring 2D/transposed values from the DataFrame set_item-related method into BlockManager.iset.

This will help for the ArrayManager, so we don't have to re-reshape there.

@jorisvandenbossche jorisvandenbossche added Refactor Internal refactoring of code Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation labels Feb 10, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Feb 10, 2021
@jreback jreback merged commit 83479e1 into pandas-dev:master Feb 10, 2021
@jorisvandenbossche jorisvandenbossche deleted the am-iset branch February 10, 2021 14:44
@@ -350,9 +350,9 @@ def test_fillna_fill_other(self, data_missing):


class TestReshaping(BaseNumPyTests, base.BaseReshapingTests):
@skip_nested
@pytest.mark.skip(reason="Incorrect expected.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand how testing with the PandasArrays works, but, within the merge implementation, we set a column in the resulting dataframe (the column with the key values) using an Index.
And doing an extract_array on an Int64Index gives a PandasDtype("int64") column when running those tests. And because in this PR I changed it such that the values being set are not converted to a 1D array before ending up in BlockManager.iset, the extension dtype is now preserved. And so the expected result from the base test is wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this turns out to be a PITA in a bunch of places. could just monkeypatch extract_array to extract PandasArray too

@@ -3889,7 +3885,6 @@ def insert(self, loc, column, value, allow_duplicates: bool = False) -> None:
"'self.flags.allows_duplicate_labels' is False."
)
value = self._sanitize_column(value)
value = _maybe_atleast_2d(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this removed all the usages of maybe_atleast_2d, so can remove the function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, will remove that in one of my next PRs

@@ -1013,6 +1013,9 @@ def value_getitem(placement):
return value

else:
if value.ndim == 2:
value = value.T

if value.ndim == self.ndim - 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can become an elif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, will also include in a next PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it needs to be an if, because this case needs to end up in the final else to define the value_getitem function

@@ -1135,6 +1138,9 @@ def insert(self, loc: int, item: Hashable, value, allow_duplicates: bool = False
# insert to the axis; this could possibly raise a TypeError
new_axis = self.items.insert(loc, item)

if value.ndim == 2:
value = value.T

if value.ndim == self.ndim - 1 and not is_extension_array_dtype(value.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif

@jreback
Copy link
Contributor

jreback commented Feb 10, 2021

sorry for so quick @jbrockmendel

will wait in future

@jorisvandenbossche
Copy link
Member Author

Yeah, I explicitly marked Brock for review ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants