Skip to content

BUG: GH9618 in read_msgpack where DataFrame has duplicate column names #10527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 18, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -384,4 +384,6 @@ Bug Fixes

- Bug in operator equal on Index not being consistent with Series (:issue:`9947`)

- Reading "famafrench" data via ``DataReader`` results in HTTP 404 error because of the website url is changed (:issue:`10591`).
- Reading "famafrench" data via ``DataReader`` results in HTTP 404 error because of the website url is changed (:issue:`10591`).

- Bug in `read_msgpack` where DataFrame to decode has duplicate column names (:issue:`9618`)
9 changes: 8 additions & 1 deletion pandas/io/packers.py
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,7 @@ def encode(obj):
'klass': obj.__class__.__name__,
'axes': data.axes,
'blocks': [{'items': data.items.take(b.mgr_locs),
'locs': b.mgr_locs.as_array,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Is it better to add locs (like it is right now) so that new msgpack can be decoded in older versions, or replace items with locs in encode() so that the encoded msgpack is smaller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, like you have it now is ok. for back-compat we can only add attributes (and can't change the meaning at all). So this is how I wanted it. The new code will decode using .locs as that avoids the duplicates problem, and can read the older msgpacks. This actually also is forward-compat in that a current version can be read by an older version (so this is the best case scenario). We don't have a tests for forward-compat, but you can manually verify (as would need the older code to run in the tests which is 'odd').

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, you were looking at size. Well that is a trivial addition to size (this is one array per block, almost negligble). All that said I think we should remove this in the future (e.g. remove the forward compat, e.g. just use .items and just use .locs). So pls create an issue and I'll it for the future to do this.

'values': convert(b.values),
'shape': b.values.shape,
'dtype': b.dtype.num,
Expand Down Expand Up @@ -485,9 +486,15 @@ def decode(obj):
def create_block(b):
values = unconvert(b['values'], dtype_for(b['dtype']),
b['compress']).reshape(b['shape'])

# locs handles duplicate column names, and should be used instead of items; see GH 9618
if 'locs' in b:
placement = b['locs']
else:
placement = axes[0].get_indexer(b['items'])
return make_block(values=values,
klass=getattr(internals, b['klass']),
placement=axes[0].get_indexer(b['items']))
placement=placement)

blocks = [create_block(b) for b in obj['blocks']]
return globals()[obj['klass']](BlockManager(blocks, axes))
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
167 changes: 0 additions & 167 deletions pandas/io/tests/generate_legacy_pickles.py

This file was deleted.

Loading