BUG: Fix issue with inserting duplicate columns in a dataframe (GH14291) #14384

paul-mannino · 2016-10-10T01:02:27Z

closes BUG: DataFrame.insert with allow_duplicates=True fails when already duplicates present #14291
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

I've been sitting on this simple fix because it seems a little kludgy. SparseDataFrame doesn't do anything parallel to broadcasting in its _sanitize_column method... should it?

bkandel · 2016-10-10T02:08:35Z

doc/source/whatsnew/v0.19.0.txt

@@ -1563,3 +1563,4 @@ Bug Fixes
 - ``PeridIndex`` can now accept ``list`` and ``array`` which contains ``pd.NaT`` (:issue:`13430`)
 - Bug in ``df.groupby`` where ``.median()`` returns arbitrary values if grouped dataframe contains empty bins (:issue:`13629`)
 - Bug in ``Index.copy()`` where ``name`` parameter was ignored (:issue:`14302`)
+- Bug in ``DataFrame.insert`` where multiple calls with duplicate columns can fail (:issue:`14291`)


Should be in 0.19.1.

bkandel · 2016-10-10T02:19:10Z

pandas/core/frame.py

@@ -2590,7 +2590,7 @@ def assign(self, **kwargs):

        return data

-    def _sanitize_column(self, key, value):
+    def _sanitize_column(self, key, value, broadcast=False):


I think it would be better to do the opposite here -- keep the default behavior of _sanitize_column the same and override in the case of insert where you specify the column. The insert case where you don't want to broadcast the value is the special one. The more common case is:

df1 = pd.DataFrame({'a': [1, 2, 3]}) df2 = pd.DataFrame({'b': [2,3,4]}) df = pd.concat([df1, df1, df2], axis=1) df.index = [1, 2, 3] df.loc[1, 'a'] = 3

You would override the defaults less often if you kept the default as True and only overrode it in the special case of insert.

Thanks for the feedback.

bkandel · 2016-10-10T02:21:04Z

pandas/sparse/frame.py

@@ -302,7 +302,7 @@ def fillna(self, value=None, method=None, axis=0, inplace=False,
    # ----------------------------------------------------------------------
    # Support different internal representation of SparseDataFrame

-    def _sanitize_column(self, key, value):
+    def _sanitize_column(self, key, value, broadcast=False):


Looks like this option is never used?

I added this in because insert can call _sanitize_column for DataFrames and SparseDataFrames. If the signatures of the _sanitize_column functions don't match, an error gets thrown. Do you know of a cleaner workaround?

I think it is OK to leave it in here, but please add a comment explaining this

How does an insert with duplicates currently work for SparseDataFrames ? Does it have the same bug?

So this was surprising to me:

df1 = pd.DataFrame({'a': [1, 2, 3]}) df2 = pd.DataFrame({'b': [2,3,4]}) df = pd.concat([df1, df1, df2], axis=1).to_sparse() df.index = [1, 2, 3] df.loc[1, 'a'] = 3 ## -- End pasted text -- --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-11-e95308cf58e6> in <module>() 3 df = pd.concat([df1, df1, df2], axis=1).to_sparse() 4 df.index = [1, 2, 3] ----> 5 df.loc[1, 'a'] = 3 /Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in __setitem__(self, key, value) 138 key = com._apply_if_callable(key, self.obj) 139 indexer = self._get_setitem_indexer(key) --> 140 self._setitem_with_indexer(indexer, value) 141 142 def _has_valid_type(self, k, axis): /Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in _setitem_with_indexer(self, indexer, value) 545 # scalar 546 for item in labels: --> 547 setter(item, value) 548 549 else: /Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in setter(item, v) 453 454 def setter(item, v): --> 455 s = self.obj[item] 456 pi = plane_indexer[0] if lplane_indexer == 1 else plane_indexer 457 /Users/bkandel/code/forkz/pandas/pandas/sparse/frame.pyc in __getitem__(self, key) 345 return self._getitem_array(key) 346 else: --> 347 return self._get_item_cache(key) 348 349 @Appender(DataFrame.get_value.__doc__, indents=0) /Users/bkandel/code/forkz/pandas/pandas/core/generic.py in _get_item_cache(self, item) 1384 if res is None: 1385 values = self._data.get(item) -> 1386 res = self._box_item_values(item, values) 1387 cache[item] = res 1388 res._set_as_cached(item, self) /Users/bkandel/code/forkz/pandas/pandas/core/frame.py in _box_item_values(self, key, values) 2391 items = self.columns[self.columns.get_loc(key)] 2392 if values.ndim == 2: -> 2393 return self._constructor(values.T, columns=items, index=self.index) 2394 else: 2395 return self._box_col_values(values, items) AttributeError: 'BlockManager' object has no attribute 'T'

Looks like the sparse version doesn't deal with the broadcasting issue at all.

@bkandel Can you open a separate issue for this?

I checked the following, which worked as expected:

df1 = pd.DataFrame({'a': [1, 2, 3]}) df2 = pd.DataFrame({'b': [2,3,4]}) df = pd.concat([df1, df1, df2], axis=1).to_sparse() df['a'] = [1,2,5]

_sanitize_columns doesn't do broadcasting, but the SparseDataFrame setter seems to handle this case properly.

When playing around with SparseDataFrames, I did hit the same error in _box_item_values, but I was going to open a separate issue for that when I got home. Couldn't really figure out why that if/else block was there to begin with, even after looking back at the commit history.

@jorisvandenbossche sure. I think fixing the behavior of the sparse data frame here is out of scope for this PR.

jorisvandenbossche · 2016-10-10T12:00:11Z

@paul-mannino Already looking good, apart from the other comments of @bkandel :

Travis is failing due to some lint error: pandas/tests/frame/test_mutate_columns.py:171:80: E501 line too long (101 > 79 characters)
Can you add a docstring to _sanitize_columns where you document the new broadcast parameter?

codecov-io · 2016-10-11T01:47:16Z

Current coverage is 85.26% (diff: 100%)

Merging #14384 into master will not change coverage

@@             master     #14384   diff @@
==========================================
  Files           140        140          
  Lines         50634      50634          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          43173      43173          
  Misses         7461       7461          
  Partials          0          0

Powered by Codecov. Last update d98e982...a00f0fe

…s-dev#14291)

bkandel · 2016-10-13T15:26:21Z

New changes addressed all of my comments. Thanks!

jreback · 2016-10-13T21:06:56Z

pandas/core/frame.py

-        # blocks) are always copied
+    def _sanitize_column(self, key, value, broadcast=True):
+        """
+        Ensures new columns (which go into the BlockManager as new blocks) are


pls add a Parameters section

jreback · 2016-10-13T21:07:27Z

pandas/sparse/frame.py

+    def _sanitize_column(self, key, value, broadcast=True):
+        """
+        The "broadcast" parameter was added to match the method signature of
+        DataFrame._sanitize_column. However, this method does not make use of


same here (you can use a shared_doc to avoid repeating things)

Both of these files import shared_doc from pandas.core.generic, but pandas.core.generic doesn't have it's own definition for _sanitize_column. I can't find any other instances of someone adding to the shared_doc without immediately using the docstring, but I can do that if that's what you want. For now, I just updated the docs in both places.

jreback · 2016-10-13T21:08:07Z

pandas/tests/frame/test_mutate_columns.py

+        # GH 14291
+        df = DataFrame()
+        df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True)
+        df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True)


use a different value each time so its clear that its not copying

this is more appropriate in test_nonunique_indexes file.

jorisvandenbossche · 2016-10-15T21:09:31Z

@paul-mannino There is going something wrong with the PRs that were open before we transferred to another repo. Can you therefore close this one and open a new PR? (just from the same branch)

bkandel reviewed Oct 10, 2016

View reviewed changes

jorisvandenbossche added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 10, 2016

paul-mannino force-pushed the master branch from a679d72 to d04b947 Compare October 11, 2016 01:47

BUG: Fix issue with inserting duplicate columns in a dataframe (panda…

a00f0fe

…s-dev#14291)

paul-mannino force-pushed the master branch from d04b947 to a00f0fe Compare October 11, 2016 02:49

jreback reviewed Oct 13, 2016

View reviewed changes

bkandel mentioned this pull request Oct 14, 2016

BUG: Assigning values in SparseDataFrame with duplicate columns fails #14427

Closed

paul-mannino force-pushed the master branch from a00f0fe to 81a57fc Compare October 15, 2016 15:12

paul-mannino closed this Oct 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix issue with inserting duplicate columns in a dataframe (GH14291) #14384

BUG: Fix issue with inserting duplicate columns in a dataframe (GH14291) #14384

paul-mannino commented Oct 10, 2016

bkandel Oct 10, 2016

bkandel Oct 10, 2016

paul-mannino Oct 11, 2016

bkandel Oct 10, 2016

paul-mannino Oct 10, 2016 •

edited

Loading

jorisvandenbossche Oct 10, 2016

bkandel Oct 10, 2016

jorisvandenbossche Oct 10, 2016

paul-mannino Oct 10, 2016 •

edited

Loading

bkandel Oct 10, 2016

jorisvandenbossche commented Oct 10, 2016

codecov-io commented Oct 11, 2016 •

edited

Loading

bkandel commented Oct 13, 2016

jreback Oct 13, 2016

jreback Oct 13, 2016

paul-mannino Oct 15, 2016

jreback Oct 13, 2016

jreback Oct 13, 2016

jorisvandenbossche commented Oct 15, 2016

BUG: Fix issue with inserting duplicate columns in a dataframe (GH14291) #14384

BUG: Fix issue with inserting duplicate columns in a dataframe (GH14291) #14384

Conversation

paul-mannino commented Oct 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-mannino Oct 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-mannino Oct 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 10, 2016

codecov-io commented Oct 11, 2016 • edited Loading

Current coverage is 85.26% (diff: 100%)

bkandel commented Oct 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 15, 2016

paul-mannino Oct 10, 2016 •

edited

Loading

paul-mannino Oct 10, 2016 •

edited

Loading

codecov-io commented Oct 11, 2016 •

edited

Loading