Skip to content

BUG: assert_frame_equal can be very slow #38183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
ivirshup opened this issue Nov 30, 2020 · 11 comments · Fixed by #38202
Closed
2 of 3 tasks

BUG: assert_frame_equal can be very slow #38183

ivirshup opened this issue Nov 30, 2020 · 11 comments · Fixed by #38202
Labels
Bug Performance Memory or execution speed performance Testing pandas testing functions or related to the test suite
Milestone

Comments

@ivirshup
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

Comparisons which can take less than a second, take 50 seconds when they are combined:

import pandas as pd
import numpy as np
from string import ascii_letters

idx = np.random.choice(np.array(list(ascii_letters)), size=10000).astype(object)
a = np.random.random_integers(0, 100, (10000, 1000))

# This is fine
%time pd.testing.assert_frame_equal(pd.DataFrame(a), pd.DataFrame(a.copy()))
# CPU times: user 409 ms, sys: 20.8 ms, total: 429 ms
# Wall time: 429 ms

# As is this
%time pd.testing.assert_index_equal(pd.Index(idx), pd.Index(idx.copy()))
# CPU times: user 1.06 ms, sys: 26 µs, total: 1.08 ms
# Wall time: 1.08 ms

# This is weird:
%time pd.testing.assert_frame_equal(pd.DataFrame(a, index=idx), pd.DataFrame(b, index=idx))
# CPU times: user 50.5 s, sys: 62.6 ms, total: 50.6 s
# Wall time: 50.6 s

Problem description

It seems weird that the whole process slows down when the index has strings in it. It's especially weird since comparing the index itself is fairly fast. This may be a regression, since I've never noticed this before, and this seem very noticeable.

This might have something to do with #38091

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.8.5.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Thu Oct 29 22:56:45 PDT 2020; root:xnu-6153.141.2.2~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.4
numpy            : 1.19.4
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 50.3.2
Cython           : 0.29.21
pytest           : 6.1.2
hypothesis       : None
sphinx           : 3.3.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.8.4
fastparquet      : 0.4.1
gcsfs            : None
matplotlib       : 3.3.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 2.0.0
pytables         : None
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.5.4
sqlalchemy       : 1.3.18
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : 0.16.1
xlrd             : 1.2.0
xlwt             : None
numba            : 0.51.2
@ivirshup ivirshup added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 30, 2020
@jreback
Copy link
Contributor

jreback commented Nov 30, 2020

interesting! thanks. can you see if you can pin this done where this slow.

@jreback jreback added Performance Memory or execution speed performance Testing pandas testing functions or related to the test suite and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 30, 2020
@jreback jreback added this to the 1.2 milestone Nov 30, 2020
@jreback
Copy link
Contributor

jreback commented Nov 30, 2020

cc @jbrockmendel

@leonarduschen
Copy link
Contributor

leonarduschen commented Nov 30, 2020

I believe this is the issue:

# TODO: fastpath for pandas' StringDtype

Prun results below. Out of 62s total runtime, 61s cumtime from _array_equivalent_object at missing.py

%prun time pd.testing.assert_frame_equal(pd.DataFrame(a, index=idx), pd.DataFrame(a, index=idx))
PRUN RESULTS

all time: 1min 2s
 
         110939878 function calls (110925852 primitive calls) in 62.198 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10011003   13.073    0.000   13.073    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     1001    9.925    0.010   61.384    0.061 missing.py:456(_array_equivalent_object)
 10010000    9.884    0.000   26.166    0.000 fromnumeric.py:70(_wrapreduction)
10019015/10018015    5.749    0.000    5.751    0.000 {built-in method numpy.array}
 10010000    5.307    0.000   31.473    0.000 fromnumeric.py:2249(any)
 10010000    5.210    0.000   41.858    0.000 <__array_function__ internals>:2(any)
 10011001    3.869    0.000   35.531    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
10019013/10018013    2.320    0.000    8.071    0.000 _asarray.py:14(asarray)
 10010000    2.146    0.000    2.146    0.000 fromnumeric.py:71(<dictcomp>)
 10148884    1.593    0.000    1.632    0.000 {built-in method builtins.isinstance}
 10010000    1.308    0.000    1.308    0.000 fromnumeric.py:2245(_any_dispatcher)
 10010000    1.070    0.000    1.070    0.000 {method 'items' of 'dict' objects}
     1001    0.176    0.000    0.189    0.000 numeric.py:2317(array_equal)
     1000    0.029    0.000   61.856    0.062 _testing.py:1217(assert_series_equal)
    87153    0.024    0.000    0.033    0.000 generic.py:10(_check)
   120236    0.022    0.000    0.031    0.000 {built-in method builtins.getattr}
     2002    0.018    0.000   61.699    0.031 {built-in method pandas._libs.testing.assert_almost_equal}
        1    0.018    0.018   62.196   62.196 _testing.py:1419(assert_frame_equal)
     1002    0.016    0.000   61.496    0.061 _testing.py:660(assert_index_equal)
     2000    0.015    0.000    0.166    0.000 indexing.py:757(_getitem_lowerdim)
     2000    0.014    0.000    0.041    0.000 managers.py:984(iget)
    23043    0.013    0.000    0.062    0.000 base.py:256(is_dtype)
     2002    0.013    0.000   61.676    0.031 missing.py:358(array_equivalent)
     4005    0.012    0.000    0.049    0.000 _testing.py:844(assert_attr_equal)
34078/22054    0.012    0.000    0.019    0.000 {built-in method builtins.len}
     4000    0.011    0.000    0.022    0.000 generic.py:5141(__setattr__)
     2000    0.010    0.000    0.241    0.000 indexing.py:864(__getitem__)
    11000    0.009    0.000    0.017    0.000 managers.py:1602(dtype)
     8026    0.009    0.000    0.016    0.000 common.py:1600(_is_dtype_type)
    11745    0.009    0.000    0.009    0.000 {built-in method _abc._abc_instancecheck}
     2000    0.008    0.000    0.033    0.000 series.py:201(__init__)
     7013    0.008    0.000    0.026    0.000 dtypes.py:906(is_dtype)
     8019    0.007    0.000    0.034    0.000 common.py:530(is_categorical_dtype)
     6004    0.007    0.000    0.074    0.000 common.py:1180(needs_i8_conversion)
     2000    0.007    0.000    0.123    0.000 indexing.py:1474(_getitem_axis)
     2000    0.007    0.000    0.100    0.000 frame.py:2816(_ixs)
     6000    0.007    0.000    0.029    0.000 common.py:97(is_bool_indexer)
    10000    0.007    0.000    0.014    0.000 {pandas._libs.lib.is_list_like}
     6004    0.007    0.000    0.022    0.000 common.py:1025(is_datetime_or_timedelta_dtype)
     4000    0.007    0.000    0.041    0.000 indexing.py:1334(_validate_key)
    11000    0.006    0.000    0.024    0.000 series.py:427(dtype)
     2002    0.006    0.000    0.015    0.000 blocks.py:124(__init__)
     1002    0.006    0.000    0.006    0.000 frozen.py:76(__eq__)
     6000    0.006    0.000    0.008    0.000 generic.py:377(_get_axis)
     7015    0.006    0.000    0.033    0.000 common.py:456(is_period_dtype)
     2000    0.005    0.000    0.041    0.000 frame.py:3184(_box_col_values)
     4000    0.005    0.000    0.017    0.000 indexing.py:1419(_validate_integer)
     2002    0.005    0.000    0.005    0.000 blocks.py:237(mgr_locs)
     6000    0.005    0.000    0.007    0.000 indexing.py:866(<genexpr>)
     2000    0.005    0.000    0.045    0.000 indexing.py:694(_has_valid_tuple)
     4000    0.005    0.000    0.006    0.000 range.py:697(__getitem__)
     2000    0.005    0.000    0.009    0.000 indexing.py:1400(_is_scalar_access)
     4000    0.004    0.000    0.008    0.000 series.py:442(name)
    13000    0.004    0.000    0.004    0.000 managers.py:1575(_block)
     2002    0.004    0.000    0.004    0.000 generic.py:195(__init__)
     6006    0.004    0.000    0.019    0.000 common.py:381(is_datetime64tz_dtype)
     2000    0.004    0.000    0.015    0.000 indexing.py:709(_is_nested_tuple_indexer)
    11745    0.004    0.000    0.013    0.000 abc.py:96(__instancecheck__)
     3003    0.004    0.000    0.031    0.000 common.py:566(is_string_dtype)
    11002    0.004    0.000    0.004    0.000 blocks.py:315(dtype)
     2004    0.004    0.000    0.005    0.000 base.py:1213(_get_names)
     2000    0.004    0.000    0.019    0.000 blocks.py:256(make_block_same_class)
     1000    0.004    0.000    0.006    0.000 range.py:341(__contains__)
     3001    0.004    0.000    0.027    0.000 {built-in method builtins.any}
     2011    0.004    0.000    0.012    0.000 common.py:492(is_interval_dtype)
     2000    0.003    0.000    0.003    0.000 generic.py:5123(__getattr__)
     2000    0.003    0.000    0.003    0.000 series.py:398(_set_axis)
     2009    0.003    0.000    0.009    0.000 dtypes.py:1119(is_dtype)
     2000    0.003    0.000    0.004    0.000 managers.py:1532(__init__)
     2000    0.003    0.000    0.214    0.000 indexing.py:1441(_getitem_tuple)
     3003    0.003    0.000    0.023    0.000 common.py:595(condition)
     7011    0.003    0.000    0.004    0.000 common.py:1565(_get_dtype)
     2000    0.003    0.000    0.008    0.000 generic.py:3180(_set_as_cached)
     2002    0.003    0.000    0.005    0.000 common.py:608(is_dtype_equal)
     4000    0.003    0.000    0.006    0.000 indexing.py:2258(is_label_like)
     1018    0.003    0.000    0.005    0.000 common.py:1460(is_extension_array_dtype)
     3003    0.003    0.000    0.027    0.000 common.py:1541(_is_dtype)
     8016    0.003    0.000    0.004    0.000 common.py:180(<lambda>)
     1000    0.003    0.000    0.012    0.000 generic.py:1719(__contains__)
     1745    0.003    0.000    0.012    0.000 inference.py:31(is_number)
     1002    0.003    0.000    0.005    0.000 _testing.py:711(_check_types)
     2000    0.002    0.000    0.002    0.000 blocks.py:319(iget)
     6000    0.002    0.000    0.002    0.000 generic.py:365(_get_axis_number)
     2000    0.002    0.000    0.004    0.000 series.py:492(name)
     1001    0.002    0.000    0.193    0.000 <__array_function__ internals>:2(array_equal)
     2000    0.002    0.000    0.002    0.000 common.py:283(is_null_slice)
     8016    0.002    0.000    0.002    0.000 common.py:178(classes)
    10000    0.002    0.000    0.002    0.000 {pandas._libs.lib.is_integer}
     2002    0.002    0.000    0.002    0.000 _testing.py:818(assert_class_equal)
     6000    0.002    0.000    0.007    0.000 indexing.py:715(<genexpr>)
     2003    0.002    0.000    0.002    0.000 _testing.py:458(_check_isinstance)
     1001    0.002    0.000    0.029    0.000 common.py:1123(is_datetimelike_v_numeric)
     2000    0.002    0.000    0.004    0.000 generic.py:471(ndim)
     6000    0.002    0.000    0.012    0.000 inference.py:185(is_array_like)
     1001    0.002    0.000    0.011    0.000 {method 'all' of 'numpy.ndarray' objects}
     9070    0.002    0.000    0.002    0.000 {built-in method builtins.issubclass}
     2000    0.002    0.000    0.005    0.000 managers.py:306(__len__)
     4014    0.002    0.000    0.003    0.000 range.py:687(__len__)
     1000    0.002    0.000    0.003    0.000 base.py:573(__array__)
     2000    0.002    0.000    0.003    0.000 managers.py:1613(internal_values)
     2000    0.002    0.000    0.002    0.000 frame.py:568(axes)
     4004    0.002    0.000    0.017    0.000 common.py:603(<genexpr>)
     2004    0.002    0.000    0.003    0.000 base.py:3838(values)
     4010    0.002    0.000    0.002    0.000 base.py:567(__len__)
     1018    0.002    0.000    0.002    0.000 base.py:413(find)
     2000    0.002    0.000    0.002    0.000 indexing.py:100(iloc)
     4000    0.002    0.000    0.002    0.000 common.py:329(apply_if_callable)
     1007    0.002    0.000    0.003    0.000 common.py:1296(is_float_dtype)
     2002    0.002    0.000    0.002    0.000 blocks.py:135(_check_ndim)
     2000    0.001    0.000    0.004    0.000 series.py:540(_values)
     2004    0.001    0.000    0.001    0.000 base.py:1175(name)
     4000    0.001    0.000    0.005    0.000 indexers.py:52(is_list_like_indexer)
     2004    0.001    0.000    0.001    0.000 {method 'view' of 'numpy.ndarray' objects}
     1004    0.001    0.000    0.002    0.000 common.py:150(ensure_python_int)
     1001    0.001    0.000    0.020    0.000 common.py:598(is_excluded_dtype)
     2010    0.001    0.000    0.002    0.000 common.py:1733(pandas_dtype)
     1000    0.001    0.000    0.003    0.000 generic.py:447(_info_axis)
     2008    0.001    0.000    0.002    0.000 inference.py:322(is_hashable)
     2002    0.001    0.000    0.002    0.000 managers.py:216(ndim)
     1001    0.001    0.000    0.002    0.000 common.py:1509(is_complex_dtype)
     3008    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     1004    0.001    0.000    0.001    0.000 {pandas._libs.lib.is_scalar}
     2000    0.001    0.000    0.007    0.000 series.py:595(__len__)
     2000    0.001    0.000    0.001    0.000 managers.py:163(blknos)
     4000    0.001    0.000    0.001    0.000 {built-in method builtins.callable}
     4027    0.001    0.000    0.001    0.000 {built-in method builtins.hasattr}
     1001    0.001    0.000    0.009    0.000 _methods.py:56(_all)
     2002    0.001    0.000    0.001    0.000 managers.py:259(items)
     2000    0.001    0.000    0.001    0.000 managers.py:179(blklocs)
     2000    0.001    0.000    0.001    0.000 {pandas._libs.lib.item_from_zerodim}
     3006    0.001    0.000    0.001    0.000 base.py:1378(nlevels)
     2000    0.000    0.000    0.000    0.000 blocks.py:201(internal_values)
     2000    0.000    0.000    0.000    0.000 base.py:637(ndim)
        4    0.000    0.000    0.000    0.000 {pandas._libs.lib.infer_dtype}
     2006    0.000    0.000    0.000    0.000 blocks.py:233(mgr_locs)
     1001    0.000    0.000    0.000    0.000 numeric.py:2313(_array_equal_dispatcher)
       31    0.000    0.000    0.000    0.000 tokenize.py:429(_tokenize)
        2    0.000    0.000    0.000    0.000 {built-in method builtins.compile}
        3    0.000    0.000    0.000    0.000 socket.py:432(send)
      4/2    0.000    0.000    0.001    0.000 base.py:293(__new__)
       33    0.000    0.000    0.000    0.000 {method 'match' of 're.Pattern' objects}
        1    0.000    0.000   62.198   62.198 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:468(make_tokens_by_line)
        2    0.000    0.000    0.000    0.000 managers.py:238(_rebuild_blknos_and_blklocs)
        1    0.000    0.000   62.198   62.198 execution.py:1200(time)
       30    0.000    0.000    0.000    0.000 <string>:1(<lambda>)
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:585(transform_cell)
       29    0.000    0.000    0.000    0.000 re.py:289(_compile)
        1    0.000    0.000   62.197   62.197 <timed eval>:1(<module>)
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:21(leading_empty_lines)
       29    0.000    0.000    0.000    0.000 tokenize.py:98(_compile)
        2    0.000    0.000    0.001    0.000 frame.py:441(__init__)
        4    0.000    0.000    0.000    0.000 {built-in method numpy.arange}
        2    0.000    0.000    0.001    0.000 construction.py:143(init_ndarray)
       29    0.000    0.000    0.000    0.000 types.py:171(__get__)
        1    0.000    0.000   62.198   62.198 interactiveshell.py:2286(run_line_magic)
        1    0.000    0.000    0.001    0.001 interactiveshell.py:3165(transform_cell)
       34    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x00007FFAA0E05C60}
        3    0.000    0.000    0.000    0.000 iostream.py:197(schedule)
       29    0.000    0.000    0.000    0.000 re.py:250(compile)
       24    0.000    0.000    0.000    0.000 traitlets.py:564(__get__)
        2    0.000    0.000    0.000    0.000 range.py:86(__new__)
        2    0.000    0.000    0.000    0.000 managers.py:1651(create_block_manager_from_blocks)
       24    0.000    0.000    0.000    0.000 traitlets.py:533(get)
        1    0.000    0.000    0.000    0.000 execution.py:1475(_format_time)
        4    0.000    0.000    0.000    0.000 {built-in method numpy.empty}
        2    0.000    0.000    0.000    0.000 iostream.py:386(write)
        4    0.000    0.000    0.000    0.000 _dtype.py:321(_name_get)
        4    0.000    0.000    0.000    0.000 inputtransformer2.py:106(_find_assign_op)
        2    0.000    0.000    0.000    0.000 blocks.py:2655(get_block_type)
        5    0.000    0.000    0.000    0.000 range.py:153(_data)
        6    0.000    0.000    0.000    0.000 managers.py:212(shape)
        2    0.000    0.000    0.000    0.000 managers.py:321(_verify_integrity)
        2    0.000    0.000    0.000    0.000 base.py:5726(_maybe_cast_data_without_dtype)
        4    0.000    0.000    0.000    0.000 common.py:218(asarray_tuplesafe)
        2    0.000    0.000    0.000    0.000 managers.py:132(__init__)
        1    0.000    0.000   62.198   62.198 magic.py:187(<lambda>)
        2    0.000    0.000    0.000    0.000 {built-in method pandas._libs.lib.is_datetime_array}
        1    0.000    0.000    0.000    0.000 splitinput.py:53(split_user_input)
        6    0.000    0.000    0.001    0.000 base.py:5559(ensure_index)
        8    0.000    0.000    0.000    0.000 base.py:5656(maybe_extract_name)
        1    0.000    0.000    0.000    0.000 prefilter.py:255(find_handler)
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:576(do_token_transforms)
        8    0.000    0.000    0.000    0.000 common.py:422(is_timedelta64_dtype)
        4    0.000    0.000    0.000    0.000 common.py:1330(is_bool_dtype)
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:544(do_one_token_transform)
        1    0.000    0.000    0.000    0.000 prefilter.py:271(prefilter_line)
        2    0.000    0.000    0.000    0.000 range.py:134(_simple_new)
        2    0.000    0.000    0.000    0.000 blocks.py:2701(make_block)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1033(_handle_fromlist)
        2    0.000    0.000    0.000    0.000 base.py:463(_simple_new)
        3    0.000    0.000    0.000    0.000 threading.py:1089(is_alive)
        1    0.000    0.000   62.198   62.198 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 base.py:5650(default_index)
        2    0.000    0.000    0.000    0.000 common.py:190(all_none)
        2    0.000    0.000    0.000    0.000 base.py:2007(is_all_dates)
       29    0.000    0.000    0.000    0.000 enum.py:683(value)
       18    0.000    0.000    0.000    0.000 managers.py:214(<genexpr>)
       29    0.000    0.000    0.000    0.000 {method 'span' of 're.Match' objects}
        8    0.000    0.000    0.000    0.000 common.py:905(is_datetime64_any_dtype)
        2    0.000    0.000    0.000    0.000 construction.py:289(_prep_ndarray)
        2    0.000    0.000    0.001    0.000 construction.py:450(_get_axes)
        2    0.000    0.000    0.000    0.000 inputtransformer2.py:79(__call__)
        1    0.000    0.000    0.000    0.000 interactiveshell.py:2385(find_line_magic)
        1    0.000    0.000    0.000    0.000 prefilter.py:314(prefilter_lines)
       33    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        2    0.000    0.000    0.000    0.000 common.py:224(is_sparse)
        1    0.000    0.000    0.000    0.000 base.py:1032(__iter__)
        2    0.000    0.000    0.000    0.000 managers.py:683(_consolidate_check)
       28    0.000    0.000    0.000    0.000 {method 'isidentifier' of 'str' objects}
        3    0.000    0.000    0.000    0.000 threading.py:1035(_wait_for_tstate_lock)
        2    0.000    0.000    0.000    0.000 iostream.py:310(_is_master_process)
        4    0.000    0.000    0.000    0.000 {method 'fill' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:369(find)
        1    0.000    0.000    0.000    0.000 interactiveshell.py:2330(get_local_scope)
        4    0.000    0.000    0.000    0.000 common.py:750(is_signed_integer_dtype)
        4    0.000    0.000    0.000    0.000 common.py:806(is_unsigned_integer_dtype)
        2    0.000    0.000    0.000    0.000 base.py:5672(_maybe_cast_with_dtype)
        4    0.000    0.000    0.000    0.000 range.py:320(dtype)
        1    0.000    0.000    0.000    0.000 codeop.py:142(__call__)
        2    0.000    0.000    0.000    0.000 frame.py:585(shape)
        2    0.000    0.000    0.000    0.000 timing.py:63(clock2)
       10    0.000    0.000    0.000    0.000 common.py:188(<lambda>)
        2    0.000    0.000    0.000    0.000 base.py:2000(inferred_type)
        1    0.000    0.000    0.000    0.000 splitinput.py:110(__init__)
        2    0.000    0.000    0.000    0.000 common.py:696(is_integer_dtype)
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:34(leading_indent)
        2    0.000    0.000    0.000    0.000 builtin_trap.py:39(__enter__)
        3    0.000    0.000    0.000    0.000 iostream.py:93(_event_pipe)
        1    0.000    0.000   62.197   62.197 {built-in method builtins.eval}
        4    0.000    0.000    0.000    0.000 base.py:544(_reset_identity)
        3    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.lock' objects}
        6    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 compilerop.py:96(ast_parse)
        1    0.000    0.000    0.000    0.000 interactiveshell.py:2397(find_magic)
        2    0.000    0.000    0.000    0.000 prefilter.py:234(get_handler_by_name)
        2    0.000    0.000    0.000    0.000 builtin_trap.py:46(__exit__)
        1    0.000    0.000   62.198   62.198 <decorator-gen-55>:1(time)
        1    0.000    0.000    0.000    0.000 encoding.py:21(get_stream_enc)
       10    0.000    0.000    0.000    0.000 common.py:183(classes_and_not_datetimelike)
        2    0.000    0.000    0.000    0.000 base.py:1182(name)
        2    0.000    0.000    0.000    0.000 managers.py:138(<listcomp>)
        1    0.000    0.000    0.000    0.000 prefilter.py:458(check)
        4    0.000    0.000    0.000    0.000 _dtype.py:307(_name_includes_bit_suffix)
        2    0.000    0.000    0.000    0.000 common.py:194(is_object_dtype)
        2    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 prefilter.py:414(check)
        7    0.000    0.000    0.000    0.000 base.py:3870(_values)
        4    0.000    0.000    0.000    0.000 managers.py:323(<genexpr>)
        2    0.000    0.000    0.000    0.000 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:94(cell_magic)
        1    0.000    0.000    0.000    0.000 prefilter.py:246(prefilter_line_info)
        4    0.000    0.000    0.000    0.000 _dtype.py:24(_kind_name)
        2    0.000    0.000    0.000    0.000 managers.py:675(is_consolidated)
        2    0.000    0.000    0.000    0.000 managers.py:684(<listcomp>)
        2    0.000    0.000    0.000    0.000 managers.py:977(_consolidate_inplace)
        3    0.000    0.000    0.000    0.000 {method 'splitlines' of 'str' objects}
        1    0.000    0.000    0.000    0.000 interactiveshell.py:3195(transform_ast)
        2    0.000    0.000    0.000    0.000 iostream.py:323(_schedule_flush)
        6    0.000    0.000    0.000    0.000 {built-in method time.perf_counter}
        1    0.000    0.000    0.000    0.000 prefilter.py:264(transform_line)
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:214(find)
        1    0.000    0.000    0.000    0.000 prefilter.py:426(check)
        2    0.000    0.000    0.000    0.000 blocks.py:311(shape)
        1    0.000    0.000    0.000    0.000 prefilter.py:482(check)
        2    0.000    0.000    0.000    0.000 {built-in method time.time}
        2    0.000    0.000    0.000    0.000 {pandas._libs.algos.ensure_object}
        1    0.000    0.000    0.000    0.000 tokenize.py:612(generate_tokens)
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:429(find)
        4    0.000    0.000    0.000    0.000 common.py:194(<genexpr>)
        2    0.000    0.000    0.000    0.000 _methods.py:53(_any)
        2    0.000    0.000    0.000    0.000 {built-in method nt.getpid}
        1    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.all}
        1    0.000    0.000    0.000    0.000 inputtransformer2.py:246(find)
        3    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 py3compat.py:26(cast_unicode)
        3    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        2    0.000    0.000    0.000    0.000 {pandas._libs.lib.is_iterator}
        3    0.000    0.000    0.000    0.000 threading.py:529(is_set)
        2    0.000    0.000    0.000    0.000 base.py:590(dtype)
        2    0.000    0.000    0.000    0.000 range.py:214(start)
        1    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 prefilter.py:440(check)
        3    0.000    0.000    0.000    0.000 numeric.py:237(inferred_type)
        1    0.000    0.000    0.000    0.000 {method 'groups' of 're.Match' objects}
        1    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 prefilter.py:147(transformers)
        1    0.000    0.000    0.000    0.000 prefilter.py:549(handle)
        2    0.000    0.000    0.000    0.000 numeric.py:81(_validate_dtype)
        2    0.000    0.000    0.000    0.000 range.py:237(stop)
        3    0.000    0.000    0.000    0.000 {method 'strip' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'isspace' of 'str' objects}
        1    0.000    0.000    0.000    0.000 interactiveshell.py:705(get_ipython)
        2    0.000    0.000    0.000    0.000 range.py:260(step)
        1    0.000    0.000    0.000    0.000 {built-in method sys._getframe}
        1    0.000    0.000    0.000    0.000 prefilter.py:183(checkers)
        1    0.000    0.000    0.000    0.000 {method 'lstrip' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}

Also, replacing idx with integers like so solves the issue, so it's not a matter of non-unique/non-sorted index

idx = np.random.choice(list(range(len(ascii_letters))), size=10000)

@jbrockmendel
Copy link
Member

it is looping over columns and doing assert_series_equal for each column, which means it is re-doing assert_index_equal 1000 times, which when you've attached an object-dtype index gets expensive.

@ivirshup
Copy link
Contributor Author

ivirshup commented Dec 1, 2020

I'm not sure it's doing assert_index_equal in a straightforward way, since that could be much faster:

s = pd.Series(np.ones(10000), index=np.random.choice(list(ascii_letters), 10000))

%timeit pd.testing.assert_series_equal(s, s)
# 36.1 ms ± 962 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.testing.assert_index_equal(s.index, s.index)
# 9.12 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It looks to me like there is an optimization for lhs is rhs in that assert_index_equal though, but it still doesn't make up for the discrepancy:

s_index = s.index.copy()

%timeit pd.testing.assert_index_equal(s.index, s_index)
# 268 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@jbrockmendel
Copy link
Member

Yes there is an optimization that assert_index_equal (really, Index.equals) does when it knows it shares the same data, or just is the same object. In the original example you created two separate Indexes that happen to share the same data, but Index.equals doesn't know that.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2020

yea i think for _array_equivalent_object we should infer_dtype on each side - if we have certain things (eg mixed) then take the slow path otherwise i think we use the fast path (which is in cython)

the slow path is really slow because it's materializing everything and doing a lot of comparisons

@jbrockmendel
Copy link
Member

could just do assert_(ea|numpy)_array_equal on each column's values instead of assert_series_equal, so the index equality is only checked the once directly in assert_frame_equal

@ivirshup
Copy link
Contributor Author

ivirshup commented Dec 1, 2020

Yes there is an optimization that assert_index_equal (really, Index.equals) does when it knows it shares the same data, or just is the same object. In the original example you created two separate Indexes that happen to share the same data, but Index.equals doesn't know that.

Sure, but this shouldn't have much to do with comparisons taking a while, since pd.testing.assert_series_equal still takes two orders of magnitude longer than pd.testing.assert_index_equal even when the indexes don't share data.

arr = np.random.choice(np.array(list(ascii_letters)), size=10000).astype(object)

str_idx1 = pd.Index(arr.copy())
str_idx2 = pd.Index(arr.copy())

intidx_s1 = pd.Series(np.ones(10000))
intidx_s2 = pd.Series(np.ones(10000)) 

stridx_s1 = pd.Series(np.ones(10000), index=str_idx1) 
stridx_s2 = pd.Series(np.ones(10000), index=str_idx2)

%timeit pd.testing.assert_index_equal(str_idx1, str_idx2)
# 262 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.testing.assert_index_equal(str_idx1, str_idx2, exact=False)  # Like in assert_series_equal
# 257 µs ± 2.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit pd.testing.assert_series_equal(intidx_s1, intidx_s2)
# 147 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit pd.testing.assert_series_equal(stridx_s1, stridx_s2)
# 35.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

According to prun, most of time is taken in assert_index_equal, but I don't see how that's happening when running assert_index_equals by itself seems quite fast.

@ivanovmg
Copy link
Member

ivanovmg commented Dec 1, 2020

Looks like inside assert_series_equal the check would go back and forth between itself and assert_almost_equal, while doing assert_index_equal check every time.

@ivirshup
Copy link
Contributor Author

ivirshup commented Dec 1, 2020

Ohh, I see I made a mistake in my last comment, and used the wrong argument:

%timeit pd.testing.assert_index_equal(str_idx1, str_idx2, check_exact=False)
# 38.5 ms ± 875 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Also, I think this added some confusion on my end:

pd._testing._testing.assert_almost_equal is pd._testing.assert_almost_equal
# False

@jreback jreback modified the milestones: 1.2, Contributions Welcome Dec 7, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Performance Memory or execution speed performance Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants