Skip to content

BUG: DataFrame.equals() wrongly returns True in case of identical blocks with different locations #28839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wings-xf opened this issue Oct 8, 2019 · 8 comments · Fixed by #29657
Labels
Milestone

Comments

@wings-xf
Copy link

wings-xf commented Oct 8, 2019

Code Sample, a copy-pastable example if possible

  version: 3.6.8
# Your code here
  df3 = pd.DataFrame({'a': [1, 2], 'b': ['s', 'd']})
  df4 = pd.DataFrame({'a': ['s', 'd'], 'b': [1, 2]})
  df3.equals(df4)

Problem description

image

When I read the source code, I did a simple test on it, and then failed.

Expected Output

I expected it return False

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.2.2
setuptools : 40.6.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : 1.3.4
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@jorisvandenbossche
Copy link
Member

Yes, that's clearly a bug. Thanks for finding that!

We seem to check the equality block by block. But that way we ignore the block locations .. (and thus how they map to column names)

@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Oct 8, 2019
@jorisvandenbossche jorisvandenbossche changed the title I'm puzzled about DataFrame.equals() BUG: DataFrame.equals() wrongly returns True in case of identical blocks with different locations Oct 8, 2019
@bhuvanakundumani
Copy link
Contributor

I am interested in working on this .

@jorisvandenbossche
Copy link
Member

@bhuvanakundumani Super, that would be very welcome! If you have any question or need a pointer, let me know.

@bhuvanakundumani
Copy link
Contributor

bhuvanakundumani commented Oct 15, 2019

@jorisvandenbossche - I am reading more about BlockManger here : https://github.com/pydata/pandas-design/blob/master/source/internal-architecture.rst.
Hope am in the right direction.

@jorisvandenbossche
Copy link
Member

@bhuvanakundumani that's a document describing a potential refactor of the internal BlockManager, so although it contains some description of the current sate, I think it is not the best document to read to understand the current implementation. I think in general we don't really have a good documentation of this, sorry ..

The code where I would start looking is at

def equals(self, other):
self_axes, other_axes = self.axes, other.axes
if len(self_axes) != len(other_axes):
return False
if not all(ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
return False
self._consolidate_inplace()
other._consolidate_inplace()
if len(self.blocks) != len(other.blocks):
return False
# canonicalize block order, using a tuple combining the type
# name and then mgr_locs because there might be unconsolidated
# blocks (say, Categorical) which can only be distinguished by
# the iteration order
def canonicalize(block):
return (block.dtype.name, block.mgr_locs.as_array.tolist())
self_blocks = sorted(self.blocks, key=canonicalize)
other_blocks = sorted(other.blocks, key=canonicalize)
return all(
block.equals(oblock) for block, oblock in zip(self_blocks, other_blocks)
)

We might need an additional check that the block locations are also equal.

@bhuvanakundumani
Copy link
Contributor

@jorisvandenbossche thanks for the info. will look into it.

@TomAugspurger
Copy link
Contributor

Pushing off the 1.0 milestone. Would certainly take a PR for 1.0 if anyone is able to get to it in time.

@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Nov 12, 2019
@Reksbril
Copy link
Contributor

I will look into it

Reksbril pushed a commit to Reksbril/pandas that referenced this issue Nov 16, 2019
…andas-dev#28839)

The function was returning True in case shown in added test. The cause
of the problem was sorting Blocks of DataFrame by type, and then
mgr_locs before comparison. It resulted in arranging the identical blocks
in the same way, which resulted in having the same two lists of blocks.
Changing sorting order to (mgr_locs, type) resolves the problem, while not
interrupting the other aspects of comparison.
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Nov 16, 2019
Reksbril pushed a commit to Reksbril/pandas that referenced this issue Nov 18, 2019
…andas-dev#28839)

The function was returning True in case shown in added test. The cause
of the problem was sorting Blocks of DataFrame by type, and then
mgr_locs before comparison. It resulted in arranging the identical blocks
in the same way, which resulted in having the same two lists of blocks.
Changing sorting order to (mgr_locs, type) resolves the problem, while not
interrupting the other aspects of comparison.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants