Skip to content

False negative on .equals() after read_hdf() #9330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wikiped opened this issue Jan 21, 2015 · 10 comments
Closed

False negative on .equals() after read_hdf() #9330

wikiped opened this issue Jan 21, 2015 · 10 comments
Labels
API Design Bug Internals Related to non-user accessible pandas implementation
Milestone

Comments

@wikiped
Copy link

wikiped commented Jan 21, 2015

I have strange results from .equals appearing when DataFrame is written to HDF Store and then read back:

import pandas as pd
df = pd.DataFrame({'B':[1,2], 'A':[str('x'), str('y')]})  # str() - just to be sure this is not linked to unicode 
print 'df:'
print df
df.to_hdf('hdf_file', 'key', format='t', mode='w')
df_out = pd.read_hdf('hdf_file', 'key')
print '\ndf_out:'
print df_out
print '\ndf equals df_out:', df.equals(df_out)
print '\ndf_out equals df:', df_out.equals(df)
print '\ndf.shape == df_out.shape:', df.shape == df_out.shape
print '\narray_equivalent(df.values, df_out.values):', pd.core.common.array_equivalent(df.values, df_out.values)
print '\ndf.index equals df_out.index:', df.index.equals(df_out.index)
print '\ndf.columns equals df_out.columns:', df.columns.equals(df_out.columns)
for col in df.columns:
    print '\ndf.{0} equals df_out.{0}: {1}'.format(col, df[col].equals(df_out[col]))

output:

df:
   A  B
0  x  1
1  y  2

df_out:
   A  B
0  x  1
1  y  2

df equals df_out: False

df_out equals df: False

df.shape == df_out.shape: True

array_equivalent(df.values, df_out.values): True

df.index equals df_out.index: True

df.columns equals df_out.columns: True

df.A equals df_out.A: True

df.B equals df_out.B: True

The interesting thing is that if DataFrame is initialized with different columns "order" in the dictionary the results are ALL True (i.e. correct):

df = pd.DataFrame({'A':[1,2], 'B':[str('x'),str('y')]})  # in the code above

will give:

df equals df_out: True
df_out equals df: True

I have seen similar issues (#8437 and #7605), which are marked as closed, but seeing this strange results... might be something different?

python 2.7.9, pandas 0.15.2

My apologies in advance for potential duplicate.

@dalejung
Copy link
Contributor

@jreback is the block-type order part of the pandas data model? As in should IntBlock always come before ObjectBlock in the blocks tuple. If so, maybe the ordering should be enforced in consolidation.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

ordering / consolidation is an impl detail only and does not impact equals

@wikiped
Copy link
Author

wikiped commented Jan 22, 2015

It also seems that only 'table' format is affected, because with format='fixed' there is no such issue.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

cc @unutbu

I think this might be a bug/undefined situation. The equals comparator in core/internals.py should compare like blocks in order. But since their is no explicity guarantee of ordering in the consolidation this may be somewhat undefined.

@jreback jreback added the Internals Related to non-user accessible pandas implementation label Jan 22, 2015
@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

The order of writing blocks in hdf is arbitrary when data_columns is not fully specified, e.g. in the example above if df.to_hdf(....., data_columns=True). Then the columns are written individually and thus the order is preserved. When data_columns is not or underspecified then the blocks are iterated, but their is not a well-defined order.

So could do one of the following:

  • doc that equals is subject to block creation/consolidation order
  • have hdf iterate on blocks in a well defined order (not sure how to do this nicely though)
  • have equals try harder to compare blocks that are potentially in a different order (but this could still fail as the order within a block is subject to the order they are added)

@brandon-rhodes
Copy link
Contributor

Note that this bug can affect data frames that have nothing to do with HDF — simply taking a data frame, doing a few rotations on its columns, and sticking it back together can result in an (invisible) change to the order in which its blocks are listed in _data and thus lead .equals() to return False despite == returning only True values. An example:

csv_text = """\
Title,Year,Director
North by Northwest,1959,Alfred Hitchcock
Notorious,1946,Alfred Hitchcock
The Philadelphia Story,1940,George Cukor
To Catch a Thief,1955,Alfred Hitchcock
His Girl Friday,1940,Howard Hawks
"""

import pandas as pd

df1 = pd.read_csv('sample.csv')
df1.columns = map(str.lower, df1.columns)
print(df1)
print()

df2 = df1.groupby(['director', df1.index]).first()
df3 = df2.reset_index('director')
df4 = df3[['title', 'year', 'director']]
df5 = df4.sort_index()
print(df5)
print()

print(df1 == df5)
print()

print(df1.equals(df5))

results in:

                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks

                    title  year          director
0      North by Northwest  1959  Alfred Hitchcock
1               Notorious  1946  Alfred Hitchcock
2  The Philadelphia Story  1940      George Cukor
3        To Catch a Thief  1955  Alfred Hitchcock
4         His Girl Friday  1940      Howard Hawks

  title  year director
0  True  True     True
1  True  True     True
2  True  True     True
3  True  True     True
4  True  True     True

False

It should probably be the responsibility of .equals() to ignore the block order, since block order is not part of its definition.

@dsm054
Copy link
Contributor

dsm054 commented Mar 28, 2015

Just as a UI issue, I think DataFrame.equals shouldn't care at all about any of this behind-the-scenes stuff. If that means we replace it entirely with something which doesn't care -- something which checks index, columns, and data (with nan==nan) -- then so be it.

@jreback
Copy link
Contributor

jreback commented Mar 28, 2015

this is pretty easy to fix, just consolidate then sort the blocks (on the block dtype) and then compare

@jreback jreback modified the milestones: Next Major Release, 0.16.1 Mar 28, 2015
@dsm054
Copy link
Contributor

dsm054 commented Mar 28, 2015

Yeah, I was thinking of something like

        # canonicalize block order
        self_blocks = sorted(self.blocks, key=lambda x: x.dtype)
        other_blocks = sorted(other.blocks, key=lambda x: x.dtype)
        return all(block.equals(oblock) for block, oblock in
                   zip(self_blocks, other_blocks))

since it's already consolidating.

Actually, now that I think of it, do we know by this point that the number of blocks is the same? Otherwise the zip will stop too early if len(self.blocks) < len(other_blocks).

@jreback
Copy link
Contributor

jreback commented Apr 5, 2015

closed by #9745

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants