BUG: DataFrame.equals should not care about block order (GH #9330) #9745

dsm054 · 2015-03-28T22:43:43Z

closes #9330 and another version of the same problem here on SO by canonicalizing the block order during an equals comparison.

Tested at the block manager level and above it at the frame level.

jreback · 2015-03-28T23:38:58Z

lgtm, pls add a release note!

jreback · 2015-03-29T01:54:19Z

worth adding that test that round trips to hdf (in the example issue)?

dsm054 · 2015-03-29T01:56:02Z

Yeah, makes sense. [I'm hoping that using dtype.name sorts out the weird TypeError issue.] Where should that test go?

jreback · 2015-03-29T02:00:56Z

put in io/tests/test_pytables.py

yes I think u have to use dtype.name

the prob is (and maybe want to include a categorical)

is that a categorical dtype is not safe to compare to a numpy dtype (but the name is)

side issue is that if u have 2 cat blocks (cats are block separated, 1 cat per) thrn I sm not sure how they sort

dsm054 · 2015-03-29T02:20:59Z

Urf. Getting the categorical stuff right is going to be a bit tricky.

Can I assume that NaN will never be an element in a category? I would have thought it would be disallowed entirely but I don't understand the output when you build one containing one.

jreback · 2015-04-02T21:37:26Z

@dsm054 you can have a nan in a category

In [5]: df = DataFrame({'A' : Series(list('aabbca')).astype('category',categories=['c','a','b',np.nan]),
   ...:                 'B' : Series(list('aabbca')).astype('category',categories=['c','a','b'])})

In [6]: df
Out[6]: 
   A  B
0  a  a
1  a  a
2  b  b
3  b  b
4  c  c
5  a  a

In [7]: df._data
Out[7]: 
BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
CategoricalBlock: slice(0, 1, 1), 1 x 6, dtype: category
CategoricalBlock: slice(1, 2, 1), 1 x 6, dtype: category

dsm054 · 2015-04-02T22:36:26Z

Blek. What's the right way to handle putting the categorical blocks -- which won't coalesce even if the categories are the same -- in a canonical order?

jreback · 2015-04-02T22:51:43Z

so, I would compare all 'regular' blocks (e.g. non-Categorical/non-Sparse) like you are doing. But these need special handling.

I think you can order by the mgr block order. These map to how the it is represented in the actual DataFrame, and so will be the same. You could actually do this for all blocks I think.

In [4]: df._data.blocks[0].mgr_locs.as_array
Out[4]: array([0])

In [5]: df._data.blocks[1].mgr_locs.as_array
Out[5]: array([1])

…v#9330)

dsm054 · 2015-04-03T01:58:01Z

@jreback: okay, I've got a version which passes both my tests and the OP's original case, by sorting on a (block type name, mgr_locs) tuple. I don't understand how the mgr blocks work well enough to judge that side of things, unfortunately, so I don't know whether matching non-consolidated blocks can be in the wrong order even after this.

Currently passing everywhere except for what looks like an unrelated library build error; probably it'll work the next time it rebuilds.

jreback · 2015-04-03T03:31:41Z

hmm, can you add a test with 2 categoricals (to your mixed test case)

dsm054 · 2015-04-03T04:01:38Z

Beyond "a:i8;b:category;c:category2;d:category2", # categories, do you mean? Two categoricals only?

jreback · 2015-04-03T11:11:55Z

sorry missed that part of the test

jreback · 2015-04-05T23:10:10Z

merged via e9179fe

thanks @dsm054
always quality code from you!

jreback added API Design Internals Related to non-user accessible pandas implementation labels Mar 28, 2015

jreback added this to the 0.16.1 milestone Mar 28, 2015

dsm054 force-pushed the equals-disregard-block-order branch from 5e1da3c to cd6dc68 Compare March 29, 2015 01:46

dsm054 force-pushed the equals-disregard-block-order branch 2 times, most recently from 518feaf to 8bde0f8 Compare April 2, 2015 23:51

BUG: DataFrame.equals should not care about block order (GH pandas-de…

18f25e4

…v#9330)

dsm054 force-pushed the equals-disregard-block-order branch from 8bde0f8 to 18f25e4 Compare April 3, 2015 00:03

jreback closed this Apr 5, 2015

jreback mentioned this pull request Apr 5, 2015

False negative on .equals() after read_hdf() #9330

Closed

tui-rob mentioned this pull request Jul 19, 2016

False negative on .equals() if indexes not identically ordered #13708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.equals should not care about block order (GH #9330) #9745

BUG: DataFrame.equals should not care about block order (GH #9330) #9745

dsm054 commented Mar 28, 2015

jreback commented Mar 28, 2015

jreback commented Mar 29, 2015

dsm054 commented Mar 29, 2015

jreback commented Mar 29, 2015

dsm054 commented Mar 29, 2015

jreback commented Apr 2, 2015

dsm054 commented Apr 2, 2015

jreback commented Apr 2, 2015

dsm054 commented Apr 3, 2015

jreback commented Apr 3, 2015

dsm054 commented Apr 3, 2015

jreback commented Apr 3, 2015

jreback commented Apr 5, 2015

BUG: DataFrame.equals should not care about block order (GH #9330) #9745

BUG: DataFrame.equals should not care about block order (GH #9330) #9745

Conversation

dsm054 commented Mar 28, 2015

jreback commented Mar 28, 2015

jreback commented Mar 29, 2015

dsm054 commented Mar 29, 2015

jreback commented Mar 29, 2015

dsm054 commented Mar 29, 2015

jreback commented Apr 2, 2015

dsm054 commented Apr 2, 2015

jreback commented Apr 2, 2015

dsm054 commented Apr 3, 2015

jreback commented Apr 3, 2015

dsm054 commented Apr 3, 2015

jreback commented Apr 3, 2015

jreback commented Apr 5, 2015