False negative on .equals() after read_hdf() #9330

wikiped · 2015-01-21T21:18:00Z

I have strange results from .equals appearing when DataFrame is written to HDF Store and then read back:

import pandas as pd
df = pd.DataFrame({'B':[1,2], 'A':[str('x'), str('y')]})  # str() - just to be sure this is not linked to unicode 
print 'df:'
print df
df.to_hdf('hdf_file', 'key', format='t', mode='w')
df_out = pd.read_hdf('hdf_file', 'key')
print '\ndf_out:'
print df_out
print '\ndf equals df_out:', df.equals(df_out)
print '\ndf_out equals df:', df_out.equals(df)
print '\ndf.shape == df_out.shape:', df.shape == df_out.shape
print '\narray_equivalent(df.values, df_out.values):', pd.core.common.array_equivalent(df.values, df_out.values)
print '\ndf.index equals df_out.index:', df.index.equals(df_out.index)
print '\ndf.columns equals df_out.columns:', df.columns.equals(df_out.columns)
for col in df.columns:
    print '\ndf.{0} equals df_out.{0}: {1}'.format(col, df[col].equals(df_out[col]))

output:

df:
   A  B
0  x  1
1  y  2

df_out:
   A  B
0  x  1
1  y  2

df equals df_out: False

df_out equals df: False

df.shape == df_out.shape: True

array_equivalent(df.values, df_out.values): True

df.index equals df_out.index: True

df.columns equals df_out.columns: True

df.A equals df_out.A: True

df.B equals df_out.B: True

The interesting thing is that if DataFrame is initialized with different columns "order" in the dictionary the results are ALL True (i.e. correct):

df = pd.DataFrame({'A':[1,2], 'B':[str('x'),str('y')]})  # in the code above

will give:

df equals df_out: True
df_out equals df: True

I have seen similar issues (#8437 and #7605), which are marked as closed, but seeing this strange results... might be something different?

python 2.7.9, pandas 0.15.2

My apologies in advance for potential duplicate.

The text was updated successfully, but these errors were encountered:

dalejung · 2015-01-21T22:13:33Z

@jreback is the block-type order part of the pandas data model? As in should IntBlock always come before ObjectBlock in the blocks tuple. If so, maybe the ordering should be enforced in consolidation.

jreback · 2015-01-22T01:07:45Z

ordering / consolidation is an impl detail only and does not impact equals

wikiped · 2015-01-22T08:33:12Z

It also seems that only 'table' format is affected, because with format='fixed' there is no such issue.

jreback · 2015-01-22T11:33:49Z

cc @unutbu

I think this might be a bug/undefined situation. The equals comparator in core/internals.py should compare like blocks in order. But since their is no explicity guarantee of ordering in the consolidation this may be somewhat undefined.

jreback · 2015-01-22T11:46:17Z

The order of writing blocks in hdf is arbitrary when data_columns is not fully specified, e.g. in the example above if df.to_hdf(....., data_columns=True). Then the columns are written individually and thus the order is preserved. When data_columns is not or underspecified then the blocks are iterated, but their is not a well-defined order.

So could do one of the following:

doc that equals is subject to block creation/consolidation order
have hdf iterate on blocks in a well defined order (not sure how to do this nicely though)
have equals try harder to compare blocks that are potentially in a different order (but this could still fail as the order within a block is subject to the order they are added)

brandon-rhodes · 2015-03-28T00:31:50Z

Note that this bug can affect data frames that have nothing to do with HDF — simply taking a data frame, doing a few rotations on its columns, and sticking it back together can result in an (invisible) change to the order in which its blocks are listed in _data and thus lead .equals() to return False despite == returning only True values. An example:

csv_text = """\
Title,Year,Director
North by Northwest,1959,Alfred Hitchcock
Notorious,1946,Alfred Hitchcock
The Philadelphia Story,1940,George Cukor
To Catch a Thief,1955,Alfred Hitchcock
His Girl Friday,1940,Howard Hawks
"""

import pandas as pd

df1 = pd.read_csv('sample.csv')
df1.columns = map(str.lower, df1.columns)
print(df1)
print()

df2 = df1.groupby(['director', df1.index]).first()
df3 = df2.reset_index('director')
df4 = df3[['title', 'year', 'director']]
df5 = df4.sort_index()
print(df5)
print()

print(df1 == df5)
print()

print(df1.equals(df5))

results in:

                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks

                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks

  title  year director
0  True  True     True
1  True  True     True
2  True  True     True
3  True  True     True
4  True  True     True

False

It should probably be the responsibility of .equals() to ignore the block order, since block order is not part of its definition.

dsm054 · 2015-03-28T00:43:53Z

Just as a UI issue, I think DataFrame.equals shouldn't care at all about any of this behind-the-scenes stuff. If that means we replace it entirely with something which doesn't care -- something which checks index, columns, and data (with nan==nan) -- then so be it.

jreback · 2015-03-28T01:31:25Z

this is pretty easy to fix, just consolidate then sort the blocks (on the block dtype) and then compare

dsm054 · 2015-03-28T01:51:10Z

Yeah, I was thinking of something like

        # canonicalize block order
        self_blocks = sorted(self.blocks, key=lambda x: x.dtype)
        other_blocks = sorted(other.blocks, key=lambda x: x.dtype)
        return all(block.equals(oblock) for block, oblock in
                   zip(self_blocks, other_blocks))

since it's already consolidating.

Actually, now that I think of it, do we know by this point that the number of blocks is the same? Otherwise the zip will stop too early if len(self.blocks) < len(other_blocks).

…v#9330)

jreback · 2015-04-05T23:10:37Z

closed by #9745

jreback added the Internals Related to non-user accessible pandas implementation label Jan 22, 2015

jreback added API Design Bug labels Jan 22, 2015

jreback modified the milestones: Next Major Release, 0.16.1 Mar 28, 2015

dsm054 mentioned this issue Mar 28, 2015

BUG: DataFrame.equals should not care about block order (GH #9330) #9745

Closed

dsm054 added a commit to dsm054/pandas that referenced this issue Apr 3, 2015

BUG: DataFrame.equals should not care about block order (GH pandas-de…

18f25e4

…v#9330)

jreback pushed a commit that referenced this issue Apr 5, 2015

BUG: DataFrame.equals should not care about block order (GH #9330)

e9179fe

jreback closed this as completed Apr 5, 2015

tui-rob mentioned this issue Jul 19, 2016

False negative on .equals() if indexes not identically ordered #13708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False negative on .equals() after read_hdf() #9330

False negative on .equals() after read_hdf() #9330

wikiped commented Jan 21, 2015

dalejung commented Jan 21, 2015

jreback commented Jan 22, 2015

wikiped commented Jan 22, 2015

jreback commented Jan 22, 2015

jreback commented Jan 22, 2015

brandon-rhodes commented Mar 28, 2015

dsm054 commented Mar 28, 2015

jreback commented Mar 28, 2015

dsm054 commented Mar 28, 2015

jreback commented Apr 5, 2015

False negative on .equals() after read_hdf() #9330

False negative on .equals() after read_hdf() #9330

Comments

wikiped commented Jan 21, 2015

dalejung commented Jan 21, 2015

jreback commented Jan 22, 2015

wikiped commented Jan 22, 2015

jreback commented Jan 22, 2015

jreback commented Jan 22, 2015

brandon-rhodes commented Mar 28, 2015

dsm054 commented Mar 28, 2015

jreback commented Mar 28, 2015

dsm054 commented Mar 28, 2015

jreback commented Apr 5, 2015