Handle duplicate column names in select_dtypes and get_dummies #20839

kunalgosar · 2018-04-27T07:48:11Z

Functions select_dtypes and get_dummies previously failed on DataFrames with duplicate column names and had strange behavior as shown below. This PR fixes that strange behavior.

Previous behavior:

In [6]: df
Out[6]: 
  col1 col1
0    1    a
1    2    b

In [7]: df.select_dtypes(include=['int'])
Out[7]: 
Empty DataFrame
Columns: []
Index: [0, 1]

In [8]: pd.get_dummies(df)
Out[8]: 
   col1_('c', 'o', 'l', '1')  col1_('c', 'o', 'l', '1')
0                          1                          1
1                          1                          1

New behavior:

In [6]: df
Out[6]: 
  col1 col1
0    1    a
1    2    b

In [7]: df.select_dtypes(include=['int'])
Out[7]: 
   col1
0     1
1     2

In [8]: pd.get_dummies(df)
Out[8]: 
   col1  col1_a  col1_b
0     1       1       0
1     2       0       1

closes select_dtypes and get_dummies break on duplicate dolumns #20848
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2018-04-27T08:37:07Z

Codecov Report

❗ No coverage uploaded for pull request base (master@e8e6e89). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #20839   +/-   ##
=========================================
  Coverage          ?   91.81%           
=========================================
  Files             ?      153           
  Lines             ?    49481           
  Branches          ?        0           
=========================================
  Hits              ?    45432           
  Misses            ?     4049           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.21% <100%> (?)`
#single	`41.85% <0%> (?)`

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.13% <100%> (ø)`
pandas/core/reshape/reshape.py	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e8e6e89...9b72a83. Read the comment docs.

jreback · 2018-04-27T10:43:52Z

does this have an open issue?

jreback

looks ok, but need to have a look after you update comments

jreback · 2018-04-27T10:45:54Z

pandas/core/reshape/reshape.py

        else:
-            with_dummies = [data.drop(columns_to_encode, axis=1)]
+            with_dummies = [data.select_dtypes(exclude=['object', 'category'])]


why did you add this?

This does the converse operation to https://github.com/pandas-dev/pandas/pull/20839/files/4a4f3093f89dfd9813fa176a7f2fa14d78affee6#diff-fef81b7e498e469973b2da18d19ff6f3R828.

Previously this was done via dropping on column names, but leads to extra columns being dropped when duplicates are present. This is a safe way to get all columns not in data_to_encode (previously columns_to_encode).

jreback · 2018-04-27T10:46:27Z

pandas/core/reshape/reshape.py

-                if not len(item) == len(columns_to_encode):
-                    len_msg = len_msg.format(name=name, len_item=len(item),
-                                             len_enc=len(columns_to_encode))
+                if not len(item) == columns_to_encode.shape[1]:


are you sure this is correct?

This check ensures that there is a prefix and prefix_sep specified for each columns to be encoded. We ensure that the lengths are the same for this.

jreback · 2018-04-27T10:47:21Z

pandas/core/reshape/reshape.py

@@ -826,45 +826,49 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,

        if columns is None:
            columns_to_encode = data.select_dtypes(


this is now not columns, so rename to something else

jreback · 2018-04-27T10:47:45Z

pandas/tests/frame/test_dtypes.py

@@ -287,6 +287,21 @@ def test_select_dtypes_include_exclude_mixed_scalars_lists(self):
        ei = df[['b', 'c', 'f', 'k']]
        assert_frame_equal(ri, ei)

+    def test_select_dtypes_duplicate_columns(self):
+        df = DataFrame({'a': list('abc'),


add the PR number as a comment

jreback · 2018-04-27T10:48:27Z

pandas/tests/frame/test_dtypes.py

+                        'c': np.arange(3, 6).astype('u1'),
+                        'd': np.arange(4.0, 7.0, dtype='float64'),
+                        'e': [True, False, True],
+                        'f': pd.date_range('now', periods=3).values})


this construction is actually not guaranteed under < py3.6, so use an OrderdDict

jreback · 2018-04-27T10:48:40Z

pandas/tests/frame/test_dtypes.py

+                        'f': pd.date_range('now', periods=3).values})
+        df.columns = ['a', 'a', 'b', 'b', 'b', 'c']
+
+        e = DataFrame({'a': list(range(1, 4)),


e -> expected, r -> result

jreback · 2018-04-27T10:48:57Z

pandas/tests/reshape/test_reshape.py

@@ -465,6 +465,20 @@ def test_get_dummies_dont_sparsify_all_columns(self, sparse):

        tm.assert_frame_equal(df[['GDP']], df2)

+    def test_get_dummies_duplicate_columns(self, df):
+        df.columns = ["A", "A", "A"]


issue number

used PR number instead as noted above.

kunalgosar · 2018-04-27T17:16:32Z

Hi @jreback, thanks for the quick turn-around. I've addressed the comments above, opened an issue and linked it above.

Let me know what you think.

jreback

looks good. some doc comments

jreback · 2018-05-04T10:25:47Z

pandas/core/reshape/reshape.py

-            columns_to_encode = data.select_dtypes(
-                include=['object', 'category']).columns
+            data_to_encode = data.select_dtypes(
+                include=['object', 'category'])


can you define these at the top of the function as they are used in 2 places

jreback · 2018-05-04T10:26:45Z

pandas/core/reshape/reshape.py


-            dummy = _get_dummies_1d(data[col], prefix=pre, prefix_sep=sep,
+            dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,


can you put a comment why are using col[1] here (IOW what is that)

jreback · 2018-05-04T10:27:38Z

pandas/core/reshape/reshape.py


-        if set(columns_to_encode) == set(data.columns):


can you add comments for these cases (IOW what they are for)

jreback

need a whatsnew note as well, you can put in 0.23.0 (if you can update in next few days)

kunalgosar · 2018-05-04T21:36:18Z

Thanks @jreback. I resolved comments and added the whatsnew note.

kunalgosar · 2018-05-05T01:50:03Z

Rebased on master so checks pass now @jreback.

jreback · 2018-05-05T12:59:12Z

thanks @kunalgosar nice patch!

jreback requested changes Apr 27, 2018

View reviewed changes

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 27, 2018

kunalgosar changed the title ~~Fix for handling duplicate column names in select_dtypes and get_dummies~~ Handle duplicate column names in select_dtypes and get_dummies Apr 28, 2018

jreback requested changes May 4, 2018

View reviewed changes

kunalgosar added 4 commits May 4, 2018 17:21

fix for duplicate cols in select_dtypes and get_dummies

1111847

implement tests to check duplicate cols

e97398c

addressing comments

3b8086a

resolve comments and add whatsnew entry

c2d3cae

kunalgosar force-pushed the get_dummies_fix branch from 6cfcbe7 to c2d3cae Compare May 5, 2018 01:03

jreback added 2 commits May 5, 2018 08:55

Merge branch 'master' into PR_TOOL_MERGE_PR_20839

07b2ecb

doc typo

9b72a83

jreback added this to the 0.23.0 milestone May 5, 2018

jreback approved these changes May 5, 2018

View reviewed changes

jreback merged commit bd4332f into pandas-dev:master May 5, 2018

kunalgosar deleted the get_dummies_fix branch May 6, 2018 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle duplicate column names in select_dtypes and get_dummies #20839

Handle duplicate column names in select_dtypes and get_dummies #20839

kunalgosar commented Apr 27, 2018 •

edited

Loading

codecov bot commented Apr 27, 2018 •

edited

Loading

jreback commented Apr 27, 2018

jreback left a comment

jreback Apr 27, 2018

kunalgosar Apr 27, 2018

jreback Apr 27, 2018

kunalgosar Apr 27, 2018

jreback Apr 27, 2018

kunalgosar Apr 27, 2018

jreback Apr 27, 2018

kunalgosar Apr 27, 2018

jreback Apr 27, 2018

kunalgosar Apr 27, 2018

jreback Apr 27, 2018

kunalgosar Apr 27, 2018

jreback Apr 27, 2018

kunalgosar Apr 27, 2018

kunalgosar commented Apr 27, 2018

jreback left a comment

jreback May 4, 2018

jreback May 4, 2018

jreback May 4, 2018

jreback left a comment

kunalgosar commented May 4, 2018

kunalgosar commented May 5, 2018

jreback commented May 5, 2018

		@@ -826,45 +826,49 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,

		if columns is None:
		columns_to_encode = data.select_dtypes(


		dummy = _get_dummies_1d(data[col], prefix=pre, prefix_sep=sep,
		dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,

Handle duplicate column names in select_dtypes and get_dummies #20839

Handle duplicate column names in select_dtypes and get_dummies #20839

Conversation

kunalgosar commented Apr 27, 2018 • edited Loading

codecov bot commented Apr 27, 2018 • edited Loading

Codecov Report

jreback commented Apr 27, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalgosar commented Apr 27, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

kunalgosar commented May 4, 2018

kunalgosar commented May 5, 2018

jreback commented May 5, 2018

kunalgosar commented Apr 27, 2018 •

edited

Loading

codecov bot commented Apr 27, 2018 •

edited

Loading