Skip to content

DEPR: DataFrame.lookup #35224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Sep 17, 2020
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
5bbddce
DEPR - 18262 - deprecate lookup
erfannariman Jul 11, 2020
02b50ac
DEPR - 18262 - changes black
erfannariman Jul 11, 2020
ca8bf05
Add test for deprecation lookup
erfannariman Jul 11, 2020
dea45b9
Added deprecation to whatsnew
erfannariman Jul 11, 2020
5f116ad
Add test for deprecation lookup
erfannariman Jul 11, 2020
0c04e90
FIX - 18262 - add warnings to other tests
erfannariman Jul 11, 2020
758468d
Merge remote-tracking branch 'upstream/master' into 18262-depr-lookup
erfannariman Jul 11, 2020
7ac3b32
DOC - 18262 - added example to lookup values
erfannariman Jul 11, 2020
2ab80cf
DOC - 18262 - point to example in depr
erfannariman Jul 11, 2020
94a6c0f
FIX - 18262 - deprecation warning before summary
erfannariman Jul 11, 2020
a339131
FIX - 18262 - whitespaces after comma
erfannariman Jul 11, 2020
269e4cc
FIX - 18262 - removed double line break
erfannariman Jul 11, 2020
0c40d69
Fix variable in ipython
erfannariman Jul 11, 2020
1ca23bc
18262 - removed linebreak
erfannariman Jul 12, 2020
9681a3d
18262 - Fix merge conflict
erfannariman Jul 15, 2020
6342ad2
18262 - replaced depr message
erfannariman Jul 15, 2020
3dfe19d
18262 - line break too long line
erfannariman Jul 15, 2020
187d47b
18262 - set header size
erfannariman Jul 15, 2020
db63df7
[FIX] - 18262 - Merge conflict
erfannariman Jul 28, 2020
227fad5
[FIX] - 18262 - removed extra dash header
erfannariman Jul 28, 2020
ce775ce
[FIX] - 18262 - reference to section in docs
erfannariman Jul 28, 2020
2ee7d09
[FIX] - 18262 - grammar / typos in docstring
erfannariman Jul 28, 2020
4c78311
Merge branch 'master' into 18262-depr-lookup
erfannariman Jul 28, 2020
f447185
Merge branch 'master' into 18262-depr-lookup
erfannariman Sep 13, 2020
dc2d367
moved depr version to 1.2
erfannariman Sep 13, 2020
293bd7a
test with linking to user guide
erfannariman Sep 13, 2020
cbca163
Remove line break
erfannariman Sep 13, 2020
90fa6a9
Merge branch 'master' into 18262-depr-lookup
erfannariman Sep 13, 2020
3eefd8e
Merge branch 'master' into 18262-depr-lookup
erfannariman Sep 14, 2020
4c3c163
Revert whatsnew v1.1.0
erfannariman Sep 14, 2020
b5a34e3
Added depr message in whatsnew v1.2.0
erfannariman Sep 14, 2020
ba4fb8a
replace query with loc
erfannariman Sep 14, 2020
6b91db6
add melt and loc to depr msg
erfannariman Sep 14, 2020
ff7724f
add dot
erfannariman Sep 14, 2020
104e3cb
added colon hyperlink
erfannariman Sep 15, 2020
25e78dd
updates
erfannariman Sep 16, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 12 additions & 5 deletions doc/source/user_guide/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1475,17 +1475,24 @@ default value.
s.get('a') # equivalent to s['a']
s.get('x', default=-1)

The :meth:`~pandas.DataFrame.lookup` method
.. _indexing.lookup

Looking up values by index/column labels
-------------------------------------------

Sometimes you want to extract a set of values given a sequence of row labels
and column labels, and the ``lookup`` method allows for this and returns a
NumPy array. For instance:
and column labels, this can be achieved by ``DataFrame.melt`` combined by filtering the corresponding
rows with ``DataFrame.query``. For instance:

.. ipython:: python

dflookup = pd.DataFrame(np.random.rand(20, 4), columns = ['A', 'B', 'C', 'D'])
dflookup.lookup(list(range(0, 10, 2)), ['B', 'C', 'A', 'B', 'D'])
df = pd.DataFrame({'col': ["A", "A", "B", "B"],
'A': [80, 23, np.nan, 22],
'B': [80, 55, 76, 67]})
df
melt = df.melt('col')
df['lookup'] = melt.query('col == variable')['value'].to_numpy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than query, use .loc here as its more idiomatic (and plus its a single statement). Do we actually need to convert to numpy? (e.g. to avoid alignment)

Copy link
Member Author

@erfannariman erfannariman Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I used to_numpy is indeed for the index alignment. With .loc and without to_numpy, I would need reset_index. So that would look like:

melt = df.melt('col')
melt = melt.loc[melt['col'] == melt['variable'], 'value']
df['lookup'] = melt.reset_index(drop=True)

print(df)
  col     A   B  lookup
0   A  80.0  80    80.0
1   A  23.0  55    23.0
2   B   NaN  76    76.0
3   B  22.0  67    67.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one do you prefer? I can understand that this is more idiomatic, althought it's a bit more code.

Copy link
Member Author

@erfannariman erfannariman Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw now I'm typing it, I realize this method would fail if the index was not 0, .., n-1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a horrible choice as an alternative for lookup. Essentially, melt = df.melt('col') duplicates the data (not to mention the extra variable column) while one can resolve to numpy indexing.

I'm not sure what's the implementation of lookup, but it's a real request emerging from what I could see on SO. To me pivot and pivot_table are more redundant than lookup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quanghm my response is to the tone of the argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not averse to the function of lookup but it's not going to be a top level api

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not averse to the function of lookup but it's not going to be a top level api

Yet another reason why I don't go and make a pull request. I don't understand what's other than top level api that I can implement lookup functionality, and still don't understand performant.

@quanghm my response is to the tone of the argument

Can you please be more specific? The way I see it, @erfannariman was basically saying that unless I made a pull request, I cannot criticize the suggested alternative as horrible even when I spent time to analyze, test and show that it is, in fact, horrible. OK, if horrible is an offensive word, my apology, English is not my first language. Let me re-phrase it is very bad.

I do feel that my effort to contribute here is not valued. That was the best I can afford to better Pandas. Not everyone can afford to make a pull request, modify the code, run and pass all the unit tests.

Thanks for all the work maintaining and developing Pandas anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I learned from the other discussion, you are not the one that has a say on bringing back lookup, and those that do don't seem to be willing to review the issue.

Opening a ticket with everything you wrote here has more chance to get reviewed by the core devs than to write all this on a closed PR. That way the other devs can see it as well, since I am quite sure they are not seeing this whole discussion.

If you don't feel comfortable open an issue, please let me know and I wil open one, since Jeff mentioned he's open for a (discussion) for re-implementing lookup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I already opened the ticket since it's better to have the discussion there: see #40140 @quanghm @jreback

df

.. _indexing.class:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -802,7 +802,7 @@ Deprecations
``check_less_precise`` parameter. (:issue:`13357`).
- :func:`DataFrame.melt` accepting a value_name that already exists is deprecated, and will be removed in a future version (:issue:`34731`)
- the ``center`` keyword in the :meth:`DataFrame.expanding` function is deprecated and will be removed in a future version (:issue:`20647`)

- :meth:`DataFrame.lookup` is deprecated and will be removed in a future version, use :meth:`DataFrame.melt` and :meth:`DataFrame.loc` instead (:issue:`18682`)


.. ---------------------------------------------------------------------------
Expand Down
11 changes: 11 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3809,6 +3809,11 @@ def lookup(self, row_labels, col_labels) -> np.ndarray:
Given equal-length arrays of row and column labels, return an
array of the values corresponding to each (row, col) pair.

.. deprecated:: 1.1.0
DataFrame.lookup is deprecated,
use DataFrame.melt and DataFrame.loc instead.
See "Indexing and selecting data" in the user guide for an example.

Parameters
----------
row_labels : sequence
Expand All @@ -3821,6 +3826,12 @@ def lookup(self, row_labels, col_labels) -> np.ndarray:
numpy.ndarray
The found values.
"""
msg = (
"The 'lookup' method is deprecated and will be"
"removed in a future version."
)
warnings.warn(msg, FutureWarning, stacklevel=2)

n = len(row_labels)
if n != len(col_labels):
raise ValueError("Row labels must have same size as column labels")
Expand Down
36 changes: 27 additions & 9 deletions pandas/tests/frame/indexing/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1340,7 +1340,8 @@ def test_lookup_float(self, float_frame):
df = float_frame
rows = list(df.index) * len(df.columns)
cols = list(df.columns) * len(df.index)
result = df.lookup(rows, cols)
with tm.assert_produces_warning(FutureWarning):
result = df.lookup(rows, cols)

expected = np.array([df.loc[r, c] for r, c in zip(rows, cols)])
tm.assert_numpy_array_equal(result, expected)
Expand All @@ -1349,7 +1350,8 @@ def test_lookup_mixed(self, float_string_frame):
df = float_string_frame
rows = list(df.index) * len(df.columns)
cols = list(df.columns) * len(df.index)
result = df.lookup(rows, cols)
with tm.assert_produces_warning(FutureWarning):
result = df.lookup(rows, cols)

expected = np.array(
[df.loc[r, c] for r, c in zip(rows, cols)], dtype=np.object_
Expand All @@ -1365,7 +1367,8 @@ def test_lookup_bool(self):
"mask_c": [False, True, False, True],
}
)
df["mask"] = df.lookup(df.index, "mask_" + df["label"])
with tm.assert_produces_warning(FutureWarning):
df["mask"] = df.lookup(df.index, "mask_" + df["label"])

exp_mask = np.array(
[df.loc[r, c] for r, c in zip(df.index, "mask_" + df["label"])]
Expand All @@ -1376,13 +1379,16 @@ def test_lookup_bool(self):

def test_lookup_raises(self, float_frame):
with pytest.raises(KeyError, match="'One or more row labels was not found'"):
float_frame.lookup(["xyz"], ["A"])
with tm.assert_produces_warning(FutureWarning):
float_frame.lookup(["xyz"], ["A"])

with pytest.raises(KeyError, match="'One or more column labels was not found'"):
float_frame.lookup([float_frame.index[0]], ["xyz"])
with tm.assert_produces_warning(FutureWarning):
float_frame.lookup([float_frame.index[0]], ["xyz"])

with pytest.raises(ValueError, match="same size"):
float_frame.lookup(["a", "b", "c"], ["a"])
with tm.assert_produces_warning(FutureWarning):
float_frame.lookup(["a", "b", "c"], ["a"])

def test_lookup_requires_unique_axes(self):
# GH#33041 raise with a helpful error message
Expand All @@ -1393,14 +1399,17 @@ def test_lookup_requires_unique_axes(self):

# homogeneous-dtype case
with pytest.raises(ValueError, match="requires unique index and columns"):
df.lookup(rows, cols)
with tm.assert_produces_warning(FutureWarning):
df.lookup(rows, cols)
with pytest.raises(ValueError, match="requires unique index and columns"):
df.T.lookup(cols, rows)
with tm.assert_produces_warning(FutureWarning):
df.T.lookup(cols, rows)

# heterogeneous dtype
df["B"] = 0
with pytest.raises(ValueError, match="requires unique index and columns"):
df.lookup(rows, cols)
with tm.assert_produces_warning(FutureWarning):
df.lookup(rows, cols)

def test_set_value(self, float_frame):
for idx in float_frame.index:
Expand Down Expand Up @@ -2232,3 +2241,12 @@ def test_object_casting_indexing_wraps_datetimelike():
assert blk.dtype == "m8[ns]" # we got the right block
val = blk.iget((0, 0))
assert isinstance(val, pd.Timedelta)


def test_lookup_deprecated():
# GH18262
df = pd.DataFrame(
{"col": ["A", "A", "B", "B"], "A": [80, 23, np.nan, 22], "B": [80, 55, 76, 67]}
)
with tm.assert_produces_warning(FutureWarning):
df.lookup(df.index, df["col"])