-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pandas get_dummies validate columns
input
#28463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added validation for the argument passed to columns
The code is inspired from a similar tests in `RaiseError`: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/arrays/categorical/test_operators.py#L97 Valid `list_like`: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/reshape/test_pivot.py#L706
What used to happen without this change? Would a less-useful error message be raised or would you get an incorrect result? For the whatsnew, add a bullet point in the appropriate subsection in doc/source/whatsnew/v1.0.0.rst |
Removed method as thier is only 1 type of call for the function `get_dummies`
Asserting list-like values for `columns` parameter in `get_dummies`
Doc at doc/source/whatsnew/v1.0.0.rst was updated also, I have updated the PR with problem and proposed solution. |
pandas/tests/reshape/test_reshape.py
Outdated
pd.Index(["baz", "zoo"]), | ||
], | ||
) | ||
def test_get_dummies_with_list_like_values(self, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this testing new behavior? Are we not already testing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the file reshape.py
there was no check on dtype passed to one of the parameters columns
and so on the issue it was mentioned add this check. Hence, after adding the test I added these tests specifically to columns
parameter. And regarding doubt whether this test was already present, I checked all tests again and there was no test to check the inputs to parameter columns
and so I referred to a similar test as mentioned in PR and added it. Please let me know if you spotted a similar test for get_dummies
which I might have missed. I hope this clears it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like
result = get_dummies(s_df, columns=s_df.columns, sparse=sparse, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line you are have pointed to does not check for string inputs to the columns
parameters also it does not test for other list-like objects passed to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this test seems superfluous; this PR should just check for invalid values in columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this test. It would be great help if could you elaborate/point me to a similar test for my reference? Because as far i understand if the value is not list-like the error message will be raised because of this PR and if invalid value (i.e. value not in data columns) is passed to columns then error will be raised because value will not be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the value is not list-like the error message will be raised because of this PR and if invalid value (i.e. value not in data columns) is passed to columns then error will be raised because value will not be present.
This test is redundant with
result = get_dummies(s_df, columns=s_df.columns, sparse=sparse, dtype=dtype) |
We want to keep your test test_get_dummies_with_string_values
, which is testing the code you've added in this PR.
pandas/core/reshape/reshape.py
Outdated
@@ -863,6 +863,8 @@ def get_dummies( | |||
# determine columns being encoded | |||
if columns is None: | |||
data_to_encode = data.select_dtypes(include=dtypes_to_encode) | |||
elif not is_list_like(columns): | |||
raise TypeError("Input must be a list-like of list-likes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message should state which parameter is incorrect (columns?)
And is it supposed to be a list-like of list-likes? This block is just checking that it's a sequence, but not making any assertion about what each element is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message should state which parameter is incorrect
I have updated the Error message based on your comment but let me point out this code which was the reason I did not update the error message.
making any assertion about what each element is.
I think this is out of scope because for all the functions definitions (involving list-like object as parameters ) I referred to in pandas, none of them had this check. @jbrockmendel Is this something which should have been checked?
pandas/tests/reshape/test_reshape.py
Outdated
pd.Index(["baz", "zoo"]), | ||
], | ||
) | ||
def test_get_dummies_with_list_like_values(self, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this test seems superfluous; this PR should just check for invalid values in columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like CI is failing. A whitespace issue in the whatsnew.
pandas/tests/reshape/test_reshape.py
Outdated
pd.Index(["baz", "zoo"]), | ||
], | ||
) | ||
def test_get_dummies_with_list_like_values(self, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the value is not list-like the error message will be raised because of this PR and if invalid value (i.e. value not in data columns) is passed to columns then error will be raised because value will not be present.
This test is redundant with
result = get_dummies(s_df, columns=s_df.columns, sparse=sparse, dtype=dtype) |
We want to keep your test test_get_dummies_with_string_values
, which is testing the code you've added in this PR.
based on the comments, i have removed the list-like objects check
As per the instructions, I have removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm as well - I just cleaned up some of the whatsnew whitespace ping on green
Any update on the PR? |
needs one more rebase, otherwise looks good to go cc @jreback |
Just fixed up the merge conflict. This should be good to go when CI is green |
pandas/tests/reshape/test_reshape.py
Outdated
@@ -608,6 +608,23 @@ def test_get_dummies_all_sparse(self): | |||
) | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize("values", ["baz", "zoo"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these two values testing meaningfully distinct cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of adding 2 different values just for the check but they are testing the same case. I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one now!
looks like trailing whitespace is choking the linter |
I removed trailing white spaces. Now it should pass |
Thanks @R1j1t ! |
Thank everyone for the reviews. It was something new for me (adding test and updating docs) but I am sure this is my first of the many PR. I will start small but I love pandas and hence contributing in my free time. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The Test code is inspired from a similar tests in
RaiseError
: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/arrays/categorical/test_operators.py#L97Valid
list_like
: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/reshape/test_pivot.py#L706This is my first pull request here and so I have taken all the necessary precautions, also I read the contributing guidelines. Not much sure of
black pandas
.Problem:
If
columns
is not passed a list-like value a substantial decrease in performance was observed.Proposed Solution:
As mentioned in the issue by @WillAyd that function should raise and hence based on similar situation in the codebase, I have edited the file to raise
TypeError
.