-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: move reshaping of array for setitem from DataFrame into BlockManager internals #39722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -350,9 +350,9 @@ def test_fillna_fill_other(self, data_missing): | |||
|
|||
|
|||
class TestReshaping(BaseNumPyTests, base.BaseReshapingTests): | |||
@skip_nested | |||
@pytest.mark.skip(reason="Incorrect expected.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand how testing with the PandasArrays works, but, within the merge
implementation, we set a column in the resulting dataframe (the column with the key values) using an Index.
And doing an extract_array
on an Int64Index gives a PandasDtype("int64") column when running those tests. And because in this PR I changed it such that the values being set are not converted to a 1D array before ending up in BlockManager.iset, the extension dtype is now preserved. And so the expected result from the base test is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this turns out to be a PITA in a bunch of places. could just monkeypatch extract_array to extract PandasArray too
@@ -3889,7 +3885,6 @@ def insert(self, loc, column, value, allow_duplicates: bool = False) -> None: | |||
"'self.flags.allows_duplicate_labels' is False." | |||
) | |||
value = self._sanitize_column(value) | |||
value = _maybe_atleast_2d(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this removed all the usages of maybe_atleast_2d, so can remove the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, will remove that in one of my next PRs
@@ -1013,6 +1013,9 @@ def value_getitem(placement): | |||
return value | |||
|
|||
else: | |||
if value.ndim == 2: | |||
value = value.T | |||
|
|||
if value.ndim == self.ndim - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can become an elif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, will also include in a next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it needs to be an if
, because this case needs to end up in the final else
to define the value_getitem
function
@@ -1135,6 +1138,9 @@ def insert(self, loc: int, item: Hashable, value, allow_duplicates: bool = False | |||
# insert to the axis; this could possibly raise a TypeError | |||
new_axis = self.items.insert(loc, item) | |||
|
|||
if value.ndim == 2: | |||
value = value.T | |||
|
|||
if value.ndim == self.ndim - 1 and not is_extension_array_dtype(value.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif
sorry for so quick @jbrockmendel will wait in future |
Yeah, I explicitly marked Brock for review ;) |
I think ideally the DataFrame does not need to be aware of how the underlying manager stores the data (as 2D, transposed or not), so moving the logic of ensuring 2D/transposed values from the DataFrame set_item-related method into
BlockManager.iset
.This will help for the ArrayManager, so we don't have to re-reshape there.