Skip to content

Handle duplicate column names in select_dtypes and get_dummies #20839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 5, 2018

Conversation

kunalgosar
Copy link
Contributor

@kunalgosar kunalgosar commented Apr 27, 2018

Functions select_dtypes and get_dummies previously failed on DataFrames with duplicate column names and had strange behavior as shown below. This PR fixes that strange behavior.

Previous behavior:

In [6]: df
Out[6]: 
  col1 col1
0    1    a
1    2    b

In [7]: df.select_dtypes(include=['int'])
Out[7]: 
Empty DataFrame
Columns: []
Index: [0, 1]

In [8]: pd.get_dummies(df)
Out[8]: 
   col1_('c', 'o', 'l', '1')  col1_('c', 'o', 'l', '1')
0                          1                          1
1                          1                          1

New behavior:

In [6]: df
Out[6]: 
  col1 col1
0    1    a
1    2    b

In [7]: df.select_dtypes(include=['int'])
Out[7]: 
   col1
0     1
1     2

In [8]: pd.get_dummies(df)
Out[8]: 
   col1  col1_a  col1_b
0     1       1       0
1     2       0       1

@codecov
Copy link

codecov bot commented Apr 27, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@e8e6e89). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20839   +/-   ##
=========================================
  Coverage          ?   91.81%           
=========================================
  Files             ?      153           
  Lines             ?    49481           
  Branches          ?        0           
=========================================
  Hits              ?    45432           
  Misses            ?     4049           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.21% <100%> (?)
#single 41.85% <0%> (?)
Impacted Files Coverage Δ
pandas/core/frame.py 97.13% <100%> (ø)
pandas/core/reshape/reshape.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8e6e89...9b72a83. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Apr 27, 2018

does this have an open issue?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, but need to have a look after you update comments

else:
with_dummies = [data.drop(columns_to_encode, axis=1)]
with_dummies = [data.select_dtypes(exclude=['object', 'category'])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you add this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does the converse operation to https://github.com/pandas-dev/pandas/pull/20839/files/4a4f3093f89dfd9813fa176a7f2fa14d78affee6#diff-fef81b7e498e469973b2da18d19ff6f3R828.

Previously this was done via dropping on column names, but leads to extra columns being dropped when duplicates are present. This is a safe way to get all columns not in data_to_encode (previously columns_to_encode).

if not len(item) == len(columns_to_encode):
len_msg = len_msg.format(name=name, len_item=len(item),
len_enc=len(columns_to_encode))
if not len(item) == columns_to_encode.shape[1]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check ensures that there is a prefix and prefix_sep specified for each columns to be encoded. We ensure that the lengths are the same for this.

@@ -826,45 +826,49 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,

if columns is None:
columns_to_encode = data.select_dtypes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now not columns, so rename to something else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

@@ -287,6 +287,21 @@ def test_select_dtypes_include_exclude_mixed_scalars_lists(self):
ei = df[['b', 'c', 'f', 'k']]
assert_frame_equal(ri, ei)

def test_select_dtypes_duplicate_columns(self):
df = DataFrame({'a': list('abc'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the PR number as a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

'c': np.arange(3, 6).astype('u1'),
'd': np.arange(4.0, 7.0, dtype='float64'),
'e': [True, False, True],
'f': pd.date_range('now', periods=3).values})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this construction is actually not guaranteed under < py3.6, so use an OrderdDict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

'f': pd.date_range('now', periods=3).values})
df.columns = ['a', 'a', 'b', 'b', 'b', 'c']

e = DataFrame({'a': list(range(1, 4)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e -> expected, r -> result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

@@ -465,6 +465,20 @@ def test_get_dummies_dont_sparsify_all_columns(self, sparse):

tm.assert_frame_equal(df[['GDP']], df2)

def test_get_dummies_duplicate_columns(self, df):
df.columns = ["A", "A", "A"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used PR number instead as noted above.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 27, 2018
@kunalgosar
Copy link
Contributor Author

Hi @jreback, thanks for the quick turn-around. I've addressed the comments above, opened an issue and linked it above.

Let me know what you think.

@kunalgosar kunalgosar changed the title Fix for handling duplicate column names in select_dtypes and get_dummies Handle duplicate column names in select_dtypes and get_dummies Apr 28, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. some doc comments

columns_to_encode = data.select_dtypes(
include=['object', 'category']).columns
data_to_encode = data.select_dtypes(
include=['object', 'category'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you define these at the top of the function as they are used in 2 places


dummy = _get_dummies_1d(data[col], prefix=pre, prefix_sep=sep,
dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a comment why are using col[1] here (IOW what is that)


if set(columns_to_encode) == set(data.columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add comments for these cases (IOW what they are for)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a whatsnew note as well, you can put in 0.23.0 (if you can update in next few days)

@kunalgosar
Copy link
Contributor Author

Thanks @jreback. I resolved comments and added the whatsnew note.

@kunalgosar
Copy link
Contributor Author

Rebased on master so checks pass now @jreback.

@jreback jreback added this to the 0.23.0 milestone May 5, 2018
@jreback jreback merged commit bd4332f into pandas-dev:master May 5, 2018
@jreback
Copy link
Contributor

jreback commented May 5, 2018

thanks @kunalgosar nice patch!

@kunalgosar kunalgosar deleted the get_dummies_fix branch May 6, 2018 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

select_dtypes and get_dummies break on duplicate dolumns
2 participants