-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix issue with inserting duplicate columns in a dataframe (GH14291) #14384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -1563,3 +1563,4 @@ Bug Fixes | |||
- ``PeridIndex`` can now accept ``list`` and ``array`` which contains ``pd.NaT`` (:issue:`13430`) | |||
- Bug in ``df.groupby`` where ``.median()`` returns arbitrary values if grouped dataframe contains empty bins (:issue:`13629`) | |||
- Bug in ``Index.copy()`` where ``name`` parameter was ignored (:issue:`14302`) | |||
- Bug in ``DataFrame.insert`` where multiple calls with duplicate columns can fail (:issue:`14291`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be in 0.19.1.
@@ -2590,7 +2590,7 @@ def assign(self, **kwargs): | |||
|
|||
return data | |||
|
|||
def _sanitize_column(self, key, value): | |||
def _sanitize_column(self, key, value, broadcast=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to do the opposite here -- keep the default behavior of _sanitize_column
the same and override in the case of insert
where you specify the column. The insert
case where you don't want to broadcast the value is the special one. The more common case is:
df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'b': [2,3,4]})
df = pd.concat([df1, df1, df2], axis=1)
df.index = [1, 2, 3]
df.loc[1, 'a'] = 3
You would override the defaults less often if you kept the default as True
and only overrode it in the special case of insert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback.
@@ -302,7 +302,7 @@ def fillna(self, value=None, method=None, axis=0, inplace=False, | |||
# ---------------------------------------------------------------------- | |||
# Support different internal representation of SparseDataFrame | |||
|
|||
def _sanitize_column(self, key, value): | |||
def _sanitize_column(self, key, value, broadcast=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this option is never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this in because insert can call _sanitize_column for DataFrames and SparseDataFrames. If the signatures of the _sanitize_column functions don't match, an error gets thrown. Do you know of a cleaner workaround?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is OK to leave it in here, but please add a comment explaining this
How does an insert with duplicates currently work for SparseDataFrames ? Does it have the same bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was surprising to me:
df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'b': [2,3,4]})
df = pd.concat([df1, df1, df2], axis=1).to_sparse()
df.index = [1, 2, 3]
df.loc[1, 'a'] = 3
## -- End pasted text --
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-e95308cf58e6> in <module>()
3 df = pd.concat([df1, df1, df2], axis=1).to_sparse()
4 df.index = [1, 2, 3]
----> 5 df.loc[1, 'a'] = 3
/Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in __setitem__(self, key, value)
138 key = com._apply_if_callable(key, self.obj)
139 indexer = self._get_setitem_indexer(key)
--> 140 self._setitem_with_indexer(indexer, value)
141
142 def _has_valid_type(self, k, axis):
/Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in _setitem_with_indexer(self, indexer, value)
545 # scalar
546 for item in labels:
--> 547 setter(item, value)
548
549 else:
/Users/bkandel/code/forkz/pandas/pandas/core/indexing.pyc in setter(item, v)
453
454 def setter(item, v):
--> 455 s = self.obj[item]
456 pi = plane_indexer[0] if lplane_indexer == 1 else plane_indexer
457
/Users/bkandel/code/forkz/pandas/pandas/sparse/frame.pyc in __getitem__(self, key)
345 return self._getitem_array(key)
346 else:
--> 347 return self._get_item_cache(key)
348
349 @Appender(DataFrame.get_value.__doc__, indents=0)
/Users/bkandel/code/forkz/pandas/pandas/core/generic.py in _get_item_cache(self, item)
1384 if res is None:
1385 values = self._data.get(item)
-> 1386 res = self._box_item_values(item, values)
1387 cache[item] = res
1388 res._set_as_cached(item, self)
/Users/bkandel/code/forkz/pandas/pandas/core/frame.py in _box_item_values(self, key, values)
2391 items = self.columns[self.columns.get_loc(key)]
2392 if values.ndim == 2:
-> 2393 return self._constructor(values.T, columns=items, index=self.index)
2394 else:
2395 return self._box_col_values(values, items)
AttributeError: 'BlockManager' object has no attribute 'T'
Looks like the sparse version doesn't deal with the broadcasting issue at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkandel Can you open a separate issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the following, which worked as expected:
df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'b': [2,3,4]})
df = pd.concat([df1, df1, df2], axis=1).to_sparse()
df['a'] = [1,2,5]
_sanitize_columns doesn't do broadcasting, but the SparseDataFrame setter seems to handle this case properly.
When playing around with SparseDataFrames, I did hit the same error in _box_item_values, but I was going to open a separate issue for that when I got home. Couldn't really figure out why that if/else block was there to begin with, even after looking back at the commit history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche sure. I think fixing the behavior of the sparse data frame here is out of scope for this PR.
@paul-mannino Already looking good, apart from the other comments of @bkandel :
|
Current coverage is 85.26% (diff: 100%)@@ master #14384 diff @@
==========================================
Files 140 140
Lines 50634 50634
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 43173 43173
Misses 7461 7461
Partials 0 0
|
New changes addressed all of my comments. Thanks! |
# blocks) are always copied | ||
def _sanitize_column(self, key, value, broadcast=True): | ||
""" | ||
Ensures new columns (which go into the BlockManager as new blocks) are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a Parameters section
def _sanitize_column(self, key, value, broadcast=True): | ||
""" | ||
The "broadcast" parameter was added to match the method signature of | ||
DataFrame._sanitize_column. However, this method does not make use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here (you can use a shared_doc to avoid repeating things)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these files import shared_doc from pandas.core.generic, but pandas.core.generic doesn't have it's own definition for _sanitize_column. I can't find any other instances of someone adding to the shared_doc without immediately using the docstring, but I can do that if that's what you want. For now, I just updated the docs in both places.
# GH 14291 | ||
df = DataFrame() | ||
df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True) | ||
df.insert(0, 'A', ['a', 'b', 'c'], allow_duplicates=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a different value each time so its clear that its not copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is more appropriate in test_nonunique_indexes file.
@paul-mannino There is going something wrong with the PRs that were open before we transferred to another repo. Can you therefore close this one and open a new PR? (just from the same branch) |
git diff upstream/master | flake8 --diff
I've been sitting on this simple fix because it seems a little kludgy. SparseDataFrame doesn't do anything parallel to broadcasting in its _sanitize_column method... should it?