-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Handle duplicate column names in select_dtypes and get_dummies #20839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #20839 +/- ##
=========================================
Coverage ? 91.81%
=========================================
Files ? 153
Lines ? 49481
Branches ? 0
=========================================
Hits ? 45432
Misses ? 4049
Partials ? 0
Continue to review full report at Codecov.
|
does this have an open issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok, but need to have a look after you update comments
pandas/core/reshape/reshape.py
Outdated
else: | ||
with_dummies = [data.drop(columns_to_encode, axis=1)] | ||
with_dummies = [data.select_dtypes(exclude=['object', 'category'])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does the converse operation to https://github.com/pandas-dev/pandas/pull/20839/files/4a4f3093f89dfd9813fa176a7f2fa14d78affee6#diff-fef81b7e498e469973b2da18d19ff6f3R828.
Previously this was done via dropping on column names, but leads to extra columns being dropped when duplicates are present. This is a safe way to get all columns not in data_to_encode
(previously columns_to_encode
).
pandas/core/reshape/reshape.py
Outdated
if not len(item) == len(columns_to_encode): | ||
len_msg = len_msg.format(name=name, len_item=len(item), | ||
len_enc=len(columns_to_encode)) | ||
if not len(item) == columns_to_encode.shape[1]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure this is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check ensures that there is a prefix
and prefix_sep
specified for each columns to be encoded. We ensure that the lengths are the same for this.
pandas/core/reshape/reshape.py
Outdated
@@ -826,45 +826,49 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |||
|
|||
if columns is None: | |||
columns_to_encode = data.select_dtypes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now not columns, so rename to something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
pandas/tests/frame/test_dtypes.py
Outdated
@@ -287,6 +287,21 @@ def test_select_dtypes_include_exclude_mixed_scalars_lists(self): | |||
ei = df[['b', 'c', 'f', 'k']] | |||
assert_frame_equal(ri, ei) | |||
|
|||
def test_select_dtypes_duplicate_columns(self): | |||
df = DataFrame({'a': list('abc'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the PR number as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
pandas/tests/frame/test_dtypes.py
Outdated
'c': np.arange(3, 6).astype('u1'), | ||
'd': np.arange(4.0, 7.0, dtype='float64'), | ||
'e': [True, False, True], | ||
'f': pd.date_range('now', periods=3).values}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this construction is actually not guaranteed under < py3.6, so use an OrderdDict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
pandas/tests/frame/test_dtypes.py
Outdated
'f': pd.date_range('now', periods=3).values}) | ||
df.columns = ['a', 'a', 'b', 'b', 'b', 'c'] | ||
|
||
e = DataFrame({'a': list(range(1, 4)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e -> expected, r -> result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved.
pandas/tests/reshape/test_reshape.py
Outdated
@@ -465,6 +465,20 @@ def test_get_dummies_dont_sparsify_all_columns(self, sparse): | |||
|
|||
tm.assert_frame_equal(df[['GDP']], df2) | |||
|
|||
def test_get_dummies_duplicate_columns(self, df): | |||
df.columns = ["A", "A", "A"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used PR number instead as noted above.
Hi @jreback, thanks for the quick turn-around. I've addressed the comments above, opened an issue and linked it above. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. some doc comments
pandas/core/reshape/reshape.py
Outdated
columns_to_encode = data.select_dtypes( | ||
include=['object', 'category']).columns | ||
data_to_encode = data.select_dtypes( | ||
include=['object', 'category']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you define these at the top of the function as they are used in 2 places
pandas/core/reshape/reshape.py
Outdated
|
||
dummy = _get_dummies_1d(data[col], prefix=pre, prefix_sep=sep, | ||
dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a comment why are using col[1] here (IOW what is that)
|
||
if set(columns_to_encode) == set(data.columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add comments for these cases (IOW what they are for)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a whatsnew note as well, you can put in 0.23.0 (if you can update in next few days)
Thanks @jreback. I resolved comments and added the whatsnew note. |
6cfcbe7
to
c2d3cae
Compare
Rebased on master so checks pass now @jreback. |
thanks @kunalgosar nice patch! |
Functions
select_dtypes
andget_dummies
previously failed on DataFrames with duplicate column names and had strange behavior as shown below. This PR fixes that strange behavior.Previous behavior:
New behavior:
git diff upstream/master -u -- "*.py" | flake8 --diff