Skip to content

Preserve Extension type on cross section #22785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 26, 2018
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -546,6 +546,7 @@ Other API Changes
- :class:`pandas.io.formats.style.Styler` supports a ``number-format`` property when using :meth:`~pandas.io.formats.style.Styler.to_excel` (:issue:`22015`)
- :meth:`DataFrame.corr` and :meth:`Series.corr` now raise a ``ValueError`` along with a helpful error message instead of a ``KeyError`` when supplied with an invalid method (:issue:`22298`)
- :meth:`shift` will now always return a copy, instead of the previous behaviour of returning self when shifting by 0 (:issue:`22397`)
- Slicing a single row of a DataFrame with multiple ExtensionArrays of the same type now preserves the dtype, rather than coercing to object (:issue:`22784`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks on DataFrame

shouldn't this be in the EA section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


.. _whatsnew_0240.deprecations:

Expand Down
38 changes: 30 additions & 8 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -791,6 +791,11 @@ def _interleave(self):
"""
dtype = _interleaved_dtype(self.blocks)

if is_extension_array_dtype(dtype):
# TODO: https://github.com/pandas-dev/pandas/issues/22791
# Give EAs some input on what happens here. Sparse needs this.
dtype = 'object'

result = np.empty(self.shape, dtype=dtype)

if result.shape[0] == 0:
Expand Down Expand Up @@ -906,14 +911,25 @@ def fast_xs(self, loc):

# unique
dtype = _interleaved_dtype(self.blocks)

n = len(items)
result = np.empty(n, dtype=dtype)
if is_extension_array_dtype(dtype):
# we'll eventually construct an ExtensionArray.
result = np.empty(n, dtype=object)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do people find this confusing? I can either

  1. duplicate the for loop, using list.append for EAs and inserting into result for other
  2. use lists everywhere
  3. use this

I chose this implementation because I assume it's slightly for wide dataframes with a numpy type, compared to building a list an then np.asarray(result) at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks good to me

else:
result = np.empty(n, dtype=dtype)

for blk in self.blocks:
# Such assignment may incorrectly coerce NaT to None
# result[blk.mgr_locs] = blk._slice((slice(None), loc))
for i, rl in enumerate(blk.mgr_locs):
result[rl] = blk._try_coerce_result(blk.iget((i, loc)))

if is_extension_array_dtype(dtype):
result = dtype.construct_array_type()._from_sequence(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this gauaranteed to be 1d at this point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result is created a few lines above with np.empty(n, dtype=object), so I assume yes

result, dtype=dtype
)

return result

def consolidate(self):
Expand Down Expand Up @@ -1855,16 +1871,22 @@ def _shape_compat(x):


def _interleaved_dtype(blocks):
if not len(blocks):
return None
# type: (List[Block]) -> Optional[Union[np.dtype, ExtensionDtype]]
"""Find the common dtype for `blocks`.

dtype = find_common_type([b.dtype for b in blocks])
Parameters
----------
blocks : List[Block]

# only numpy compat
if isinstance(dtype, (PandasExtensionDtype, ExtensionDtype)):
dtype = np.object
Returns
-------
dtype : Optional[Union[np.dtype, ExtensionDtype]]
None is returned when `blocks` is empty.
"""
if not len(blocks):
return None

return dtype
return find_common_type([b.dtype for b in blocks])


def _consolidate(blocks):
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/frame/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -839,6 +839,14 @@ def test_constructor_list_str_na(self, string_dtype):
def test_is_homogeneous(self, data, expected):
assert data._is_homogeneous is expected

def test_asarray_homogenous(self):
df = pd.DataFrame({"A": pd.Categorical([1, 2]),
"B": pd.Categorical([1, 2])})
result = np.asarray(df)
# may change from object in the future
expected = np.array([[1, 1], [2, 2,]], dtype='object')
tm.assert_numpy_array_equal(result, expected)


class TestDataFrameDatetimeWithTZ(TestData):

Expand Down
28 changes: 28 additions & 0 deletions pandas/tests/indexing/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1079,3 +1079,31 @@ def test_validate_indices_high():
def test_validate_indices_empty():
with tm.assert_raises_regex(IndexError, "indices are out"):
validate_indices(np.array([0, 1]), 0)


def test_extension_array_cross_section():
# A cross-section of a homogeneous EA should be an EA
df = pd.DataFrame({
"A": pd.core.arrays.integer_array([1, 2]),
"B": pd.core.arrays.integer_array([3, 4])
}, index=['a', 'b'])
expected = pd.Series(pd.core.arrays.integer_array([1, 3]),
index=['A', 'B'], name='a')
result = df.loc['a']
tm.assert_series_equal(result, expected)

result = df.iloc[0]
tm.assert_series_equal(result, expected)


def test_extension_array_cross_section_converts():
df = pd.DataFrame({
"A": pd.core.arrays.integer_array([1, 2]),
"B": np.array([1, 2]),
}, index=['a', 'b'])
result = df.loc['a']
expected = pd.Series([1, 1], dtype=object, index=['A', 'B'], name='a')
tm.assert_series_equal(result, expected)

result = df.iloc[0]
tm.assert_series_equal(result, expected)