Dataframe loc #575

randolf-scholz · 2023-03-14T11:32:11Z

Closes Simplify DataFrame.loc.__getitem__ overload #574
Tests added: test_loc_tuple_slice_list

twoertwein · 2023-03-14T16:05:04Z

pandas-stubs/core/frame.pyi

@@ -160,8 +160,8 @@ class _LocIndexerFrame(_LocIndexer):
        | Callable[[DataFrame], IndexType | MaskType | list[HashableT]]
        | list[HashableT]
        | tuple[
-            IndexType | MaskType | list[HashableT] | Hashable,
-            list[HashableT] | slice | Series[bool] | Callable,
+            Iterable[HashableT] | slice | Hashable,


I don't think you need HashableT for Iterable as it is covariant: Iterable[Hashable]

Dr-Irv · 2023-03-14T15:34:39Z

pandas-stubs/core/frame.pyi

@@ -160,8 +160,8 @@ class _LocIndexerFrame(_LocIndexer):
        | Callable[[DataFrame], IndexType | MaskType | list[HashableT]]
        | list[HashableT]
        | tuple[
-            IndexType | MaskType | list[HashableT] | Hashable,
-            list[HashableT] | slice | Series[bool] | Callable,
+            Iterable[HashableT] | slice | Hashable,


Iterable[HashableT] is too wide, as it will match a plain string, which, if supplied, would return a Series.

So please put back IndexType | MaskType | list[HashableT] and replace Hashable with _IndexSliceTuple | Callable

Including slice is fine.

If you make the suggested change, that will cause the test test_frame.py:test_loc_slice() to fail, but I now realize that the expression used there is ambiguous:

>>> df1 = pd.DataFrame( ... {"x": [1, 2, 3, 4]}, ... index=pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["num", "let"]), ... ) >>> df1.loc[1, :] x let a 1 b 2 >>> df2 = pd.DataFrame({"x": [1,2,3,4]}, index=[10, 20, 30, 40]) >>> df2.loc[10, :] x 1 Name: 10, dtype: int64

So the first argument as an integer could return a DataFrame or Series, dependent on whether the underlying index is a regular Index or MultiIndex

The solution is then to add another overload in _LocIndexerFrame.__getitem__():

@overload def __getitem__(self, idx: tuple[ScalarT, slice]) -> Series | DataFrame: ...

Then modify the test in test_index_slice() to check that the type is Union[pd.Series, pd.DataFrame], and add another test corresponding to df2 above.

Matching str is fine here, because the second component of the tuple ensures multiple columns are selected.

Both df.loc["row", ["col1", "col2", "col3"]] and df.loc[["r", "o", "w"], ["col1", "col2", "col3"]] return DataFrame which is the only thing this overload ensures.

Matching str is fine here, because the second component of the tuple ensures multiple columns are selected.

Both df.loc["row", ["col1", "col2", "col3"]] and df.loc[["r", "o", "w"], ["col1", "col2", "col3"]] return DataFrame which is the only thing this overload ensures.

No, you are incorrect. The first example could create a Series:

>>> import pandas as pd >>> df = pd.DataFrame({"x":[1,2,3], "y":[4,5,6]}, index=["a", "b", "c"]) >>> df x y a 1 4 b 2 5 c 3 6 >>> df.loc["a", ["x", "y"]] x 1 y 4 Name: a, dtype: int64 >>> type(df.loc["a", ["x", "y"]]) <class 'pandas.core.series.Series'>

randolf-scholz · 2023-03-14T17:13:29Z

Closing since DataFrame.loc["str", columns] automatically coerces to Series.

slice is already implicitly included by IndexType.

Dr-Irv · 2023-03-14T17:14:53Z

@randolf-scholz should we close #574 as well?

Dr-Irv · 2023-03-14T17:17:40Z

Closing since DataFrame.loc["str", columns] automatically coerces to Series.

Actually, it could be a DataFrame too:

>>> df = pd.DataFrame({"x":[1,2,3,4]},
...     index=pd.MultiIndex.from_product([["a", "b"], ["c", "d"]]))
>>> df
     x
a c  1
  d  2
b c  3
  d  4
>>> df.loc["a", ["x"]]
   x
c  1
d  2

That's why I suggested making other changes to fix this.

Would you mind creating another issue? (Or I can do it)

randolf-scholz added 4 commits March 14, 2023 11:32

fix pandas-dev#573

1e86aae

Added unit-test test_loc_tuple.

dc24486

abandoned pandas-dev#573, but reworked as pandas-dev#574

73fc26c

removed surplus empty space

ea40fde

twoertwein reviewed Mar 14, 2023

View reviewed changes

Dr-Irv requested changes Mar 14, 2023

View reviewed changes

Dr-Irv mentioned this pull request Mar 14, 2023

Simplify DataFrame.loc.__getitem__ overload #574

Closed

randolf-scholz closed this Mar 14, 2023

randolf-scholz deleted the dataframe_loc branch March 14, 2023 17:14

Dr-Irv mentioned this pull request Mar 14, 2023

Fix .loc to handle ambiguity if a single Scalar is first element of a tuple #576

Closed

randolf-scholz restored the dataframe_loc branch July 25, 2023 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe loc #575

Dataframe loc #575

randolf-scholz commented Mar 14, 2023

twoertwein Mar 14, 2023

Dr-Irv Mar 14, 2023

randolf-scholz Mar 14, 2023 •

edited

Loading

Dr-Irv Mar 14, 2023

randolf-scholz Mar 14, 2023

randolf-scholz commented Mar 14, 2023

Dr-Irv commented Mar 14, 2023

Dr-Irv commented Mar 14, 2023

Dataframe loc #575

Dataframe loc #575

Conversation

randolf-scholz commented Mar 14, 2023

twoertwein Mar 14, 2023

Choose a reason for hiding this comment

Dr-Irv Mar 14, 2023

Choose a reason for hiding this comment

randolf-scholz Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

Dr-Irv Mar 14, 2023

Choose a reason for hiding this comment

randolf-scholz Mar 14, 2023

Choose a reason for hiding this comment

randolf-scholz commented Mar 14, 2023

Dr-Irv commented Mar 14, 2023

Dr-Irv commented Mar 14, 2023

randolf-scholz Mar 14, 2023 •

edited

Loading