Skip to content

Data corruption when renaming to duplicate column names #7017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Gerenuk opened this issue May 1, 2014 · 1 comment
Closed

Data corruption when renaming to duplicate column names #7017

Gerenuk opened this issue May 1, 2014 · 1 comment

Comments

@Gerenuk
Copy link

Gerenuk commented May 1, 2014

The following throws a very confusing error (pandas 0.13.1):

pd.DataFrame([[1,"abc", 1]], columns=["a", "b", "a"]).describe()

File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\frame.py", line 3790, in describe
numdata = self._get_numeric_data()
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\generic.py", line 1894, in _get_numeric_data
self._data.get_numeric_data()).finalize(self)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\internals.py", line 2596, in get_numeric_data
return self.get_data(**kwargs)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\internals.py", line 2610, in get_data
return self.combine(blocks)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\internals.py", line 2624, in combine
return self.class(new_blocks, new_axes, do_integrity_check=False)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\internals.py", line 2037, in init
self._set_ref_locs(do_refs=True)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\internals.py", line 2189, in _set_ref_locs
rl[loc] = (block, i)
IndexError: list assignment index out of range

pd.DataFrame([[1,"abc", 1]], columns=["a", "b", "a"]).info()
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\frame.py", line 1443, in info
counts = self.count()
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\frame.py", line 3862, in count
result = notnull(frame).sum(axis=axis)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\common.py", line 273, in notnull
res = isnull(obj)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\common.py", line 125, in isnull
return _isnull(obj)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\common.py", line 137, in _isnull_new
return obj._constructor(obj._data.apply(lambda x: isnull(x.values)))
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\internals.py", line 2384, in apply
do_integrity_check=do_integrity_check)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\internals.py", line 2037, in init
self._set_ref_locs(do_refs=True)
File "C:\Program Files\Python 3.3.3\lib\site-packages\pandas\core\internals.py", line 2177, in _set_ref_locs
'have _ref_locs set' % (block, labels))
AssertionError: Cannot create BlockManager._ref_locs because block [BoolBlock: [a, a], 2 x 1, dtype: bool] with duplicate items [Index(['a', 'b', 'a'], dtype='object')] does not have _ref_locs set

Note that there are multiple ways to avoid the error:

  • not having duplicate column names
  • changing the type of column "b" or deleting it!

In my program I had duplicate names only accidently, but still this behaviour and dependence on the non-duplicate column looks like a bug to me.

@jreback
Copy link
Contributor

jreback commented May 1, 2014

this was fixed in master/0.14 in #4421 thanks for reporting

@jreback jreback closed this as completed May 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants