Skip to content

Pandas get_dummies validate columns input #28463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 22, 2019
Merged

Pandas get_dummies validate columns input #28463

merged 15 commits into from
Oct 22, 2019

Conversation

R1j1t
Copy link
Contributor

@R1j1t R1j1t commented Sep 16, 2019

The Test code is inspired from a similar tests in
RaiseError: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/arrays/categorical/test_operators.py#L97
Valid list_like: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/reshape/test_pivot.py#L706

This is my first pull request here and so I have taken all the necessary precautions, also I read the contributing guidelines. Not much sure of black pandas .

Problem:
If columns is not passed a list-like value a substantial decrease in performance was observed.

Proposed Solution:
As mentioned in the issue by @WillAyd that function should raise and hence based on similar situation in the codebase, I have edited the file to raise TypeError.

@jbrockmendel
Copy link
Member

What used to happen without this change? Would a less-useful error message be raised or would you get an incorrect result? For the whatsnew, add a bullet point in the appropriate subsection in doc/source/whatsnew/v1.0.0.rst

Removed method as thier is only 1 type of call for the function `get_dummies`
Asserting list-like values for `columns` parameter in `get_dummies`
@R1j1t
Copy link
Contributor Author

R1j1t commented Sep 17, 2019

Doc at doc/source/whatsnew/v1.0.0.rst was updated also, I have updated the PR with problem and proposed solution.

pd.Index(["baz", "zoo"]),
],
)
def test_get_dummies_with_list_like_values(self, values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this testing new behavior? Are we not already testing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the file reshape.py there was no check on dtype passed to one of the parameters columns and so on the issue it was mentioned add this check. Hence, after adding the test I added these tests specifically to columns parameter. And regarding doubt whether this test was already present, I checked all tests again and there was no test to check the inputs to parameter columns and so I referred to a similar test as mentioned in PR and added it. Please let me know if you spotted a similar test for get_dummies which I might have missed. I hope this clears it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like

result = get_dummies(s_df, columns=s_df.columns, sparse=sparse, dtype=dtype)
tests with both columns and dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line you are have pointed to does not check for string inputs to the columns parameters also it does not test for other list-like objects passed to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this test seems superfluous; this PR should just check for invalid values in columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this test. It would be great help if could you elaborate/point me to a similar test for my reference? Because as far i understand if the value is not list-like the error message will be raised because of this PR and if invalid value (i.e. value not in data columns) is passed to columns then error will be raised because value will not be present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the value is not list-like the error message will be raised because of this PR and if invalid value (i.e. value not in data columns) is passed to columns then error will be raised because value will not be present.

This test is redundant with

result = get_dummies(s_df, columns=s_df.columns, sparse=sparse, dtype=dtype)
.

We want to keep your test test_get_dummies_with_string_values, which is testing the code you've added in this PR.

@@ -863,6 +863,8 @@ def get_dummies(
# determine columns being encoded
if columns is None:
data_to_encode = data.select_dtypes(include=dtypes_to_encode)
elif not is_list_like(columns):
raise TypeError("Input must be a list-like of list-likes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message should state which parameter is incorrect (columns?)

And is it supposed to be a list-like of list-likes? This block is just checking that it's a sequence, but not making any assertion about what each element is.

Copy link
Contributor Author

@R1j1t R1j1t Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message should state which parameter is incorrect

I have updated the Error message based on your comment but let me point out this code which was the reason I did not update the error message.

making any assertion about what each element is.

I think this is out of scope because for all the functions definitions (involving list-like object as parameters ) I referred to in pandas, none of them had this check. @jbrockmendel Is this something which should have been checked?

pd.Index(["baz", "zoo"]),
],
)
def test_get_dummies_with_list_like_values(self, values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this test seems superfluous; this PR should just check for invalid values in columns.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 18, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like CI is failing. A whitespace issue in the whatsnew.

pd.Index(["baz", "zoo"]),
],
)
def test_get_dummies_with_list_like_values(self, values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the value is not list-like the error message will be raised because of this PR and if invalid value (i.e. value not in data columns) is passed to columns then error will be raised because value will not be present.

This test is redundant with

result = get_dummies(s_df, columns=s_df.columns, sparse=sparse, dtype=dtype)
.

We want to keep your test test_get_dummies_with_string_values, which is testing the code you've added in this PR.

based on the comments, i have removed the list-like objects check
@R1j1t
Copy link
Contributor Author

R1j1t commented Sep 20, 2019

As per the instructions, I have removed the test_get_dummies_with_list_like_values

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as well - I just cleaned up some of the whatsnew whitespace ping on green

@R1j1t
Copy link
Contributor Author

R1j1t commented Sep 26, 2019

Any update on the PR?

@jbrockmendel
Copy link
Member

needs one more rebase, otherwise looks good to go cc @jreback

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2019

Just fixed up the merge conflict. This should be good to go when CI is green

@@ -608,6 +608,23 @@ def test_get_dummies_all_sparse(self):
)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("values", ["baz", "zoo"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these two values testing meaningfully distinct cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of adding 2 different values just for the check but they are testing the same case. I will remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one now!

@jbrockmendel
Copy link
Member

looks like trailing whitespace is choking the linter

@R1j1t
Copy link
Contributor Author

R1j1t commented Oct 12, 2019

I removed trailing white spaces. Now it should pass

@WillAyd WillAyd added this to the 1.0 milestone Oct 22, 2019
@WillAyd WillAyd merged commit fe97cad into pandas-dev:master Oct 22, 2019
@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2019

Thanks @R1j1t !

@R1j1t
Copy link
Contributor Author

R1j1t commented Oct 22, 2019

Thank everyone for the reviews. It was something new for me (adding test and updating docs) but I am sure this is my first of the many PR. I will start small but I love pandas and hence contributing in my free time.

HawkinsBA pushed a commit to HawkinsBA/pandas that referenced this pull request Oct 29, 2019
Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pandas get_dummies validate "columns" input
5 participants