Skip to content

REF: dont consolidate in BlockManager.equals #34962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 6, 2020

Conversation

jbrockmendel
Copy link
Member

Avoid side-effects.

Side-note: with EA.equals recently added to the interface, should we update array_equivalent to use it? ATM array_equivalent starts by casting both inputs to ndarray.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, we no longer use the Block.equals methods?

return False
elif isinstance(left, ExtensionArray):
if not left.equals(right):
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be return left.equals(right) ?

And same for array_equivalent below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will update

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, wait, not here. this is an "all" check, so we only return on False.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, it's inside the for loop ..

@jorisvandenbossche
Copy link
Member

Side-note: with EA.equals recently added to the interface, should we update array_equivalent to use it?

Probably, yes

@jbrockmendel
Copy link
Member Author

With this change, we no longer use the Block.equals methods?

I think this is correct, will double-check and update.

return all(
block.equals(oblock) for block, oblock in zip(self_blocks, other_blocks)
)
if self.ndim == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give some comments here (or at least a comment about the ordering of the checks)

else:
return array_equivalent(blk.values, rblk.values)

for i in range(len(self.items)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added

@jreback jreback added Internals Related to non-user accessible pandas implementation Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 24, 2020
@jbrockmendel
Copy link
Member Author

gentle ping

@@ -389,7 +389,7 @@ def test_copy(self, mgr):

# copy assertion we either have a None for a base or in case of
# some blocks it is an array (e.g. datetimetz), but was copied
assert cp_blk.equals(blk)
tm.assert_equal(cp_blk.values, blk.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should equals also work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Block.equals is removed

@jreback jreback added this to the 1.1 milestone Jul 6, 2020
@jreback jreback merged commit 77f4a4d into pandas-dev:master Jul 6, 2020
@jbrockmendel jbrockmendel deleted the ref-mgr-equals-2 branch July 6, 2020 00:51
jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Jul 7, 2020
* REF: dont consolidate in BlockManager.equals

* doctest fixup

* Remove Block.equals

* simplify, comments
@TomAugspurger
Copy link
Contributor

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 8, 2020

Some slowdown can probably expected for this case since it went from block-wise to column-wise (while having a consolidated block).

But I think a large part of the slowdown is due to array_equivalent being much slower for certain dtypes right now. For example, the FloatBlock had a custom equals method not using array_equivalent but:

def equals(self, other) -> bool:
        if self.dtype != other.dtype or self.shape != other.shape:
            return False
        left, right = self.values, other.values
        return ((left == right) | (np.isnan(left) & np.isnan(right))).all()

and this is quite a bit faster:

In [5]: a = np.random.randn(N)      

In [6]: %timeit ((a == a) | (np.isnan(a) & np.isnan(a))).all()    
5.28 µs ± 366 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: %timeit pd.core.dtypes.missing.array_equivalent(a, a)    
57.7 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@jorisvandenbossche
Copy link
Member

I suppose that a similar "fastpath" could be added to array_equivalent, though?

Also for DatetimeTZBlock, there was an override of equals for performance:

    def equals(self, other) -> bool:
        # override for significant performance improvement
        if self.dtype != other.dtype or self.shape != other.shape:
            return False
        return (self.values.view("i8") == other.values.view("i8")).all()

@jbrockmendel
Copy link
Member Author

I think the thing to do here is optimize array_equivalent.

@jbrockmendel
Copy link
Member Author

see also #32339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants