BUG: DataFrame.equals() wrongly returns True in case of identical blocks with different locations #28839

wings-xf · 2019-10-08T10:18:37Z

Code Sample, a copy-pastable example if possible

  version: 3.6.8
# Your code here
  df3 = pd.DataFrame({'a': [1, 2], 'b': ['s', 'd']})
  df4 = pd.DataFrame({'a': ['s', 'd'], 'b': [1, 2]})
  df3.equals(df4)

Problem description

When I read the source code, I did a simple test on it, and then failed.

Expected Output

I expected it return False

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.2.2
setuptools : 40.6.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : 1.3.4
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-10-08T13:32:46Z

Yes, that's clearly a bug. Thanks for finding that!

We seem to check the equality block by block. But that way we ignore the block locations .. (and thus how they map to column names)

bhuvanakundumani · 2019-10-10T05:54:01Z

I am interested in working on this .

jorisvandenbossche · 2019-10-10T06:15:44Z

@bhuvanakundumani Super, that would be very welcome! If you have any question or need a pointer, let me know.

bhuvanakundumani · 2019-10-15T13:36:10Z

@jorisvandenbossche - I am reading more about BlockManger here : https://github.com/pydata/pandas-design/blob/master/source/internal-architecture.rst.
Hope am in the right direction.

jorisvandenbossche · 2019-10-18T18:17:32Z

@bhuvanakundumani that's a document describing a potential refactor of the internal BlockManager, so although it contains some description of the current sate, I think it is not the best document to read to understand the current implementation. I think in general we don't really have a good documentation of this, sorry ..

The code where I would start looking is at

pandas/pandas/core/internals/managers.py

Lines 1398 to 1420 in 58d34d9

    
           def equals(self, other): 
        
               self_axes, other_axes = self.axes, other.axes 
        
               if len(self_axes) != len(other_axes): 
        
                   return False 
        
               if not all(ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)): 
        
                   return False 
        
               self._consolidate_inplace() 
        
               other._consolidate_inplace() 
        
               if len(self.blocks) != len(other.blocks): 
        
                   return False 
        
               # canonicalize block order, using a tuple combining the type 
        
               # name and then mgr_locs because there might be unconsolidated 
        
               # blocks (say, Categorical) which can only be distinguished by 
        
               # the iteration order 
        
               def canonicalize(block): 
        
                   return (block.dtype.name, block.mgr_locs.as_array.tolist()) 
        
               self_blocks = sorted(self.blocks, key=canonicalize) 
        
               other_blocks = sorted(other.blocks, key=canonicalize) 
        
               return all( 
        
                   block.equals(oblock) for block, oblock in zip(self_blocks, other_blocks) 
        
               )

We might need an additional check that the block locations are also equal.

bhuvanakundumani · 2019-10-21T05:25:37Z

@jorisvandenbossche thanks for the info. will look into it.

TomAugspurger · 2019-11-12T15:15:13Z

Pushing off the 1.0 milestone. Would certainly take a PR for 1.0 if anyone is able to get to it in time.

Reksbril · 2019-11-15T15:21:25Z

I will look into it

…andas-dev#28839) The function was returning True in case shown in added test. The cause of the problem was sorting Blocks of DataFrame by type, and then mgr_locs before comparison. It resulted in arranging the identical blocks in the same way, which resulted in having the same two lists of blocks. Changing sorting order to (mgr_locs, type) resolves the problem, while not interrupting the other aspects of comparison.

…das-dev#29657)

jorisvandenbossche added the Bug label Oct 8, 2019

jorisvandenbossche added this to the 1.0 milestone Oct 8, 2019

jorisvandenbossche changed the title ~~I'm puzzled about DataFrame.equals()~~ BUG: DataFrame.equals() wrongly returns True in case of identical blocks with different locations Oct 8, 2019

TomAugspurger modified the milestones: 1.0, Contributions Welcome Nov 12, 2019

Reksbril mentioned this issue Nov 16, 2019

BUG: resolved problem with DataFrame.equals() (#28839) #29657

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Nov 16, 2019

jreback closed this as completed in #29657 Nov 19, 2019

jreback pushed a commit that referenced this issue Nov 19, 2019

BUG: resolved problem with DataFrame.equals() (#28839) (#29657)

3005908

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

BUG: resolved problem with DataFrame.equals() (pandas-dev#28839) (pan…

e38d176

…das-dev#29657)

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

BUG: resolved problem with DataFrame.equals() (pandas-dev#28839) (pan…

274ec11

…das-dev#29657)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.equals() wrongly returns True in case of identical blocks with different locations #28839

BUG: DataFrame.equals() wrongly returns True in case of identical blocks with different locations #28839

wings-xf commented Oct 8, 2019 •

edited

Loading

INSTALLED VERSIONS

jorisvandenbossche commented Oct 8, 2019

bhuvanakundumani commented Oct 10, 2019

jorisvandenbossche commented Oct 10, 2019

bhuvanakundumani commented Oct 15, 2019 •

edited

Loading

jorisvandenbossche commented Oct 18, 2019

bhuvanakundumani commented Oct 21, 2019

TomAugspurger commented Nov 12, 2019

Reksbril commented Nov 15, 2019

BUG: DataFrame.equals() wrongly returns True in case of identical blocks with different locations #28839

BUG: DataFrame.equals() wrongly returns True in case of identical blocks with different locations #28839

Comments

wings-xf commented Oct 8, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Oct 8, 2019

bhuvanakundumani commented Oct 10, 2019

jorisvandenbossche commented Oct 10, 2019

bhuvanakundumani commented Oct 15, 2019 • edited Loading

jorisvandenbossche commented Oct 18, 2019

bhuvanakundumani commented Oct 21, 2019

TomAugspurger commented Nov 12, 2019

Reksbril commented Nov 15, 2019

wings-xf commented Oct 8, 2019 •

edited

Loading

Output of `pd.show_versions()`

bhuvanakundumani commented Oct 15, 2019 •

edited

Loading