Skip to content

BUG: DataFrame.equals should not care about block order (GH #9330) #9745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

dsm054
Copy link
Contributor

@dsm054 dsm054 commented Mar 28, 2015

closes #9330 and another version of the same problem here on SO by canonicalizing the block order during an equals comparison.

Tested at the block manager level and above it at the frame level.

@jreback jreback added API Design Internals Related to non-user accessible pandas implementation labels Mar 28, 2015
@jreback jreback added this to the 0.16.1 milestone Mar 28, 2015
@jreback
Copy link
Contributor

jreback commented Mar 28, 2015

lgtm, pls add a release note!

@dsm054 dsm054 force-pushed the equals-disregard-block-order branch from 5e1da3c to cd6dc68 Compare March 29, 2015 01:46
@jreback
Copy link
Contributor

jreback commented Mar 29, 2015

worth adding that test that round trips to hdf (in the example issue)?

@dsm054
Copy link
Contributor Author

dsm054 commented Mar 29, 2015

Yeah, makes sense. [I'm hoping that using dtype.name sorts out the weird TypeError issue.] Where should that test go?

@jreback
Copy link
Contributor

jreback commented Mar 29, 2015

put in io/tests/test_pytables.py

yes I think u have to use dtype.name

the prob is (and maybe want to include a categorical)

is that a categorical dtype is not safe to compare to a numpy dtype (but the name is)

side issue is that if u have 2 cat blocks (cats are block separated, 1 cat per) thrn I sm not sure how they sort

@dsm054
Copy link
Contributor Author

dsm054 commented Mar 29, 2015

Urf. Getting the categorical stuff right is going to be a bit tricky.

Can I assume that NaN will never be an element in a category? I would have thought it would be disallowed entirely but I don't understand the output when you build one containing one.

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

@dsm054 you can have a nan in a category

In [5]: df = DataFrame({'A' : Series(list('aabbca')).astype('category',categories=['c','a','b',np.nan]),
   ...:                 'B' : Series(list('aabbca')).astype('category',categories=['c','a','b'])})

In [6]: df
Out[6]: 
   A  B
0  a  a
1  a  a
2  b  b
3  b  b
4  c  c
5  a  a

In [7]: df._data
Out[7]: 
BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
CategoricalBlock: slice(0, 1, 1), 1 x 6, dtype: category
CategoricalBlock: slice(1, 2, 1), 1 x 6, dtype: category

@dsm054
Copy link
Contributor Author

dsm054 commented Apr 2, 2015

Blek. What's the right way to handle putting the categorical blocks -- which won't coalesce even if the categories are the same -- in a canonical order?

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

so, I would compare all 'regular' blocks (e.g. non-Categorical/non-Sparse) like you are doing. But these need special handling.

I think you can order by the mgr block order. These map to how the it is represented in the actual DataFrame, and so will be the same. You could actually do this for all blocks I think.

In [4]: df._data.blocks[0].mgr_locs.as_array
Out[4]: array([0])

In [5]: df._data.blocks[1].mgr_locs.as_array
Out[5]: array([1])

@dsm054 dsm054 force-pushed the equals-disregard-block-order branch 2 times, most recently from 518feaf to 8bde0f8 Compare April 2, 2015 23:51
@dsm054 dsm054 force-pushed the equals-disregard-block-order branch from 8bde0f8 to 18f25e4 Compare April 3, 2015 00:03
@dsm054
Copy link
Contributor Author

dsm054 commented Apr 3, 2015

@jreback: okay, I've got a version which passes both my tests and the OP's original case, by sorting on a (block type name, mgr_locs) tuple. I don't understand how the mgr blocks work well enough to judge that side of things, unfortunately, so I don't know whether matching non-consolidated blocks can be in the wrong order even after this.

Currently passing everywhere except for what looks like an unrelated library build error; probably it'll work the next time it rebuilds.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2015

hmm, can you add a test with 2 categoricals (to your mixed test case)

@dsm054
Copy link
Contributor Author

dsm054 commented Apr 3, 2015

Beyond "a:i8;b:category;c:category2;d:category2", # categories, do you mean? Two categoricals only?

@jreback
Copy link
Contributor

jreback commented Apr 3, 2015

sorry missed that part of the test

@jreback
Copy link
Contributor

jreback commented Apr 5, 2015

merged via e9179fe

thanks @dsm054
always quality code from you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

False negative on .equals() after read_hdf()
2 participants