-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: assert_frame_equal can be very slow #38183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
interesting! thanks. can you see if you can pin this done where this slow. |
I believe this is the issue: pandas/pandas/core/dtypes/missing.py Line 420 in 450384e
Prun results below. Out of 62s total runtime, 61s cumtime from %prun time pd.testing.assert_frame_equal(pd.DataFrame(a, index=idx), pd.DataFrame(a, index=idx)) PRUN RESULTS
all time: 1min 2s
110939878 function calls (110925852 primitive calls) in 62.198 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
10011003 13.073 0.000 13.073 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1001 9.925 0.010 61.384 0.061 missing.py:456(_array_equivalent_object)
10010000 9.884 0.000 26.166 0.000 fromnumeric.py:70(_wrapreduction)
10019015/10018015 5.749 0.000 5.751 0.000 {built-in method numpy.array}
10010000 5.307 0.000 31.473 0.000 fromnumeric.py:2249(any)
10010000 5.210 0.000 41.858 0.000 <__array_function__ internals>:2(any)
10011001 3.869 0.000 35.531 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
10019013/10018013 2.320 0.000 8.071 0.000 _asarray.py:14(asarray)
10010000 2.146 0.000 2.146 0.000 fromnumeric.py:71(<dictcomp>)
10148884 1.593 0.000 1.632 0.000 {built-in method builtins.isinstance}
10010000 1.308 0.000 1.308 0.000 fromnumeric.py:2245(_any_dispatcher)
10010000 1.070 0.000 1.070 0.000 {method 'items' of 'dict' objects}
1001 0.176 0.000 0.189 0.000 numeric.py:2317(array_equal)
1000 0.029 0.000 61.856 0.062 _testing.py:1217(assert_series_equal)
87153 0.024 0.000 0.033 0.000 generic.py:10(_check)
120236 0.022 0.000 0.031 0.000 {built-in method builtins.getattr}
2002 0.018 0.000 61.699 0.031 {built-in method pandas._libs.testing.assert_almost_equal}
1 0.018 0.018 62.196 62.196 _testing.py:1419(assert_frame_equal)
1002 0.016 0.000 61.496 0.061 _testing.py:660(assert_index_equal)
2000 0.015 0.000 0.166 0.000 indexing.py:757(_getitem_lowerdim)
2000 0.014 0.000 0.041 0.000 managers.py:984(iget)
23043 0.013 0.000 0.062 0.000 base.py:256(is_dtype)
2002 0.013 0.000 61.676 0.031 missing.py:358(array_equivalent)
4005 0.012 0.000 0.049 0.000 _testing.py:844(assert_attr_equal)
34078/22054 0.012 0.000 0.019 0.000 {built-in method builtins.len}
4000 0.011 0.000 0.022 0.000 generic.py:5141(__setattr__)
2000 0.010 0.000 0.241 0.000 indexing.py:864(__getitem__)
11000 0.009 0.000 0.017 0.000 managers.py:1602(dtype)
8026 0.009 0.000 0.016 0.000 common.py:1600(_is_dtype_type)
11745 0.009 0.000 0.009 0.000 {built-in method _abc._abc_instancecheck}
2000 0.008 0.000 0.033 0.000 series.py:201(__init__)
7013 0.008 0.000 0.026 0.000 dtypes.py:906(is_dtype)
8019 0.007 0.000 0.034 0.000 common.py:530(is_categorical_dtype)
6004 0.007 0.000 0.074 0.000 common.py:1180(needs_i8_conversion)
2000 0.007 0.000 0.123 0.000 indexing.py:1474(_getitem_axis)
2000 0.007 0.000 0.100 0.000 frame.py:2816(_ixs)
6000 0.007 0.000 0.029 0.000 common.py:97(is_bool_indexer)
10000 0.007 0.000 0.014 0.000 {pandas._libs.lib.is_list_like}
6004 0.007 0.000 0.022 0.000 common.py:1025(is_datetime_or_timedelta_dtype)
4000 0.007 0.000 0.041 0.000 indexing.py:1334(_validate_key)
11000 0.006 0.000 0.024 0.000 series.py:427(dtype)
2002 0.006 0.000 0.015 0.000 blocks.py:124(__init__)
1002 0.006 0.000 0.006 0.000 frozen.py:76(__eq__)
6000 0.006 0.000 0.008 0.000 generic.py:377(_get_axis)
7015 0.006 0.000 0.033 0.000 common.py:456(is_period_dtype)
2000 0.005 0.000 0.041 0.000 frame.py:3184(_box_col_values)
4000 0.005 0.000 0.017 0.000 indexing.py:1419(_validate_integer)
2002 0.005 0.000 0.005 0.000 blocks.py:237(mgr_locs)
6000 0.005 0.000 0.007 0.000 indexing.py:866(<genexpr>)
2000 0.005 0.000 0.045 0.000 indexing.py:694(_has_valid_tuple)
4000 0.005 0.000 0.006 0.000 range.py:697(__getitem__)
2000 0.005 0.000 0.009 0.000 indexing.py:1400(_is_scalar_access)
4000 0.004 0.000 0.008 0.000 series.py:442(name)
13000 0.004 0.000 0.004 0.000 managers.py:1575(_block)
2002 0.004 0.000 0.004 0.000 generic.py:195(__init__)
6006 0.004 0.000 0.019 0.000 common.py:381(is_datetime64tz_dtype)
2000 0.004 0.000 0.015 0.000 indexing.py:709(_is_nested_tuple_indexer)
11745 0.004 0.000 0.013 0.000 abc.py:96(__instancecheck__)
3003 0.004 0.000 0.031 0.000 common.py:566(is_string_dtype)
11002 0.004 0.000 0.004 0.000 blocks.py:315(dtype)
2004 0.004 0.000 0.005 0.000 base.py:1213(_get_names)
2000 0.004 0.000 0.019 0.000 blocks.py:256(make_block_same_class)
1000 0.004 0.000 0.006 0.000 range.py:341(__contains__)
3001 0.004 0.000 0.027 0.000 {built-in method builtins.any}
2011 0.004 0.000 0.012 0.000 common.py:492(is_interval_dtype)
2000 0.003 0.000 0.003 0.000 generic.py:5123(__getattr__)
2000 0.003 0.000 0.003 0.000 series.py:398(_set_axis)
2009 0.003 0.000 0.009 0.000 dtypes.py:1119(is_dtype)
2000 0.003 0.000 0.004 0.000 managers.py:1532(__init__)
2000 0.003 0.000 0.214 0.000 indexing.py:1441(_getitem_tuple)
3003 0.003 0.000 0.023 0.000 common.py:595(condition)
7011 0.003 0.000 0.004 0.000 common.py:1565(_get_dtype)
2000 0.003 0.000 0.008 0.000 generic.py:3180(_set_as_cached)
2002 0.003 0.000 0.005 0.000 common.py:608(is_dtype_equal)
4000 0.003 0.000 0.006 0.000 indexing.py:2258(is_label_like)
1018 0.003 0.000 0.005 0.000 common.py:1460(is_extension_array_dtype)
3003 0.003 0.000 0.027 0.000 common.py:1541(_is_dtype)
8016 0.003 0.000 0.004 0.000 common.py:180(<lambda>)
1000 0.003 0.000 0.012 0.000 generic.py:1719(__contains__)
1745 0.003 0.000 0.012 0.000 inference.py:31(is_number)
1002 0.003 0.000 0.005 0.000 _testing.py:711(_check_types)
2000 0.002 0.000 0.002 0.000 blocks.py:319(iget)
6000 0.002 0.000 0.002 0.000 generic.py:365(_get_axis_number)
2000 0.002 0.000 0.004 0.000 series.py:492(name)
1001 0.002 0.000 0.193 0.000 <__array_function__ internals>:2(array_equal)
2000 0.002 0.000 0.002 0.000 common.py:283(is_null_slice)
8016 0.002 0.000 0.002 0.000 common.py:178(classes)
10000 0.002 0.000 0.002 0.000 {pandas._libs.lib.is_integer}
2002 0.002 0.000 0.002 0.000 _testing.py:818(assert_class_equal)
6000 0.002 0.000 0.007 0.000 indexing.py:715(<genexpr>)
2003 0.002 0.000 0.002 0.000 _testing.py:458(_check_isinstance)
1001 0.002 0.000 0.029 0.000 common.py:1123(is_datetimelike_v_numeric)
2000 0.002 0.000 0.004 0.000 generic.py:471(ndim)
6000 0.002 0.000 0.012 0.000 inference.py:185(is_array_like)
1001 0.002 0.000 0.011 0.000 {method 'all' of 'numpy.ndarray' objects}
9070 0.002 0.000 0.002 0.000 {built-in method builtins.issubclass}
2000 0.002 0.000 0.005 0.000 managers.py:306(__len__)
4014 0.002 0.000 0.003 0.000 range.py:687(__len__)
1000 0.002 0.000 0.003 0.000 base.py:573(__array__)
2000 0.002 0.000 0.003 0.000 managers.py:1613(internal_values)
2000 0.002 0.000 0.002 0.000 frame.py:568(axes)
4004 0.002 0.000 0.017 0.000 common.py:603(<genexpr>)
2004 0.002 0.000 0.003 0.000 base.py:3838(values)
4010 0.002 0.000 0.002 0.000 base.py:567(__len__)
1018 0.002 0.000 0.002 0.000 base.py:413(find)
2000 0.002 0.000 0.002 0.000 indexing.py:100(iloc)
4000 0.002 0.000 0.002 0.000 common.py:329(apply_if_callable)
1007 0.002 0.000 0.003 0.000 common.py:1296(is_float_dtype)
2002 0.002 0.000 0.002 0.000 blocks.py:135(_check_ndim)
2000 0.001 0.000 0.004 0.000 series.py:540(_values)
2004 0.001 0.000 0.001 0.000 base.py:1175(name)
4000 0.001 0.000 0.005 0.000 indexers.py:52(is_list_like_indexer)
2004 0.001 0.000 0.001 0.000 {method 'view' of 'numpy.ndarray' objects}
1004 0.001 0.000 0.002 0.000 common.py:150(ensure_python_int)
1001 0.001 0.000 0.020 0.000 common.py:598(is_excluded_dtype)
2010 0.001 0.000 0.002 0.000 common.py:1733(pandas_dtype)
1000 0.001 0.000 0.003 0.000 generic.py:447(_info_axis)
2008 0.001 0.000 0.002 0.000 inference.py:322(is_hashable)
2002 0.001 0.000 0.002 0.000 managers.py:216(ndim)
1001 0.001 0.000 0.002 0.000 common.py:1509(is_complex_dtype)
3008 0.001 0.000 0.001 0.000 {built-in method builtins.hash}
1004 0.001 0.000 0.001 0.000 {pandas._libs.lib.is_scalar}
2000 0.001 0.000 0.007 0.000 series.py:595(__len__)
2000 0.001 0.000 0.001 0.000 managers.py:163(blknos)
4000 0.001 0.000 0.001 0.000 {built-in method builtins.callable}
4027 0.001 0.000 0.001 0.000 {built-in method builtins.hasattr}
1001 0.001 0.000 0.009 0.000 _methods.py:56(_all)
2002 0.001 0.000 0.001 0.000 managers.py:259(items)
2000 0.001 0.000 0.001 0.000 managers.py:179(blklocs)
2000 0.001 0.000 0.001 0.000 {pandas._libs.lib.item_from_zerodim}
3006 0.001 0.000 0.001 0.000 base.py:1378(nlevels)
2000 0.000 0.000 0.000 0.000 blocks.py:201(internal_values)
2000 0.000 0.000 0.000 0.000 base.py:637(ndim)
4 0.000 0.000 0.000 0.000 {pandas._libs.lib.infer_dtype}
2006 0.000 0.000 0.000 0.000 blocks.py:233(mgr_locs)
1001 0.000 0.000 0.000 0.000 numeric.py:2313(_array_equal_dispatcher)
31 0.000 0.000 0.000 0.000 tokenize.py:429(_tokenize)
2 0.000 0.000 0.000 0.000 {built-in method builtins.compile}
3 0.000 0.000 0.000 0.000 socket.py:432(send)
4/2 0.000 0.000 0.001 0.000 base.py:293(__new__)
33 0.000 0.000 0.000 0.000 {method 'match' of 're.Pattern' objects}
1 0.000 0.000 62.198 62.198 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 inputtransformer2.py:468(make_tokens_by_line)
2 0.000 0.000 0.000 0.000 managers.py:238(_rebuild_blknos_and_blklocs)
1 0.000 0.000 62.198 62.198 execution.py:1200(time)
30 0.000 0.000 0.000 0.000 <string>:1(<lambda>)
1 0.000 0.000 0.000 0.000 inputtransformer2.py:585(transform_cell)
29 0.000 0.000 0.000 0.000 re.py:289(_compile)
1 0.000 0.000 62.197 62.197 <timed eval>:1(<module>)
1 0.000 0.000 0.000 0.000 inputtransformer2.py:21(leading_empty_lines)
29 0.000 0.000 0.000 0.000 tokenize.py:98(_compile)
2 0.000 0.000 0.001 0.000 frame.py:441(__init__)
4 0.000 0.000 0.000 0.000 {built-in method numpy.arange}
2 0.000 0.000 0.001 0.000 construction.py:143(init_ndarray)
29 0.000 0.000 0.000 0.000 types.py:171(__get__)
1 0.000 0.000 62.198 62.198 interactiveshell.py:2286(run_line_magic)
1 0.000 0.000 0.001 0.001 interactiveshell.py:3165(transform_cell)
34 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x00007FFAA0E05C60}
3 0.000 0.000 0.000 0.000 iostream.py:197(schedule)
29 0.000 0.000 0.000 0.000 re.py:250(compile)
24 0.000 0.000 0.000 0.000 traitlets.py:564(__get__)
2 0.000 0.000 0.000 0.000 range.py:86(__new__)
2 0.000 0.000 0.000 0.000 managers.py:1651(create_block_manager_from_blocks)
24 0.000 0.000 0.000 0.000 traitlets.py:533(get)
1 0.000 0.000 0.000 0.000 execution.py:1475(_format_time)
4 0.000 0.000 0.000 0.000 {built-in method numpy.empty}
2 0.000 0.000 0.000 0.000 iostream.py:386(write)
4 0.000 0.000 0.000 0.000 _dtype.py:321(_name_get)
4 0.000 0.000 0.000 0.000 inputtransformer2.py:106(_find_assign_op)
2 0.000 0.000 0.000 0.000 blocks.py:2655(get_block_type)
5 0.000 0.000 0.000 0.000 range.py:153(_data)
6 0.000 0.000 0.000 0.000 managers.py:212(shape)
2 0.000 0.000 0.000 0.000 managers.py:321(_verify_integrity)
2 0.000 0.000 0.000 0.000 base.py:5726(_maybe_cast_data_without_dtype)
4 0.000 0.000 0.000 0.000 common.py:218(asarray_tuplesafe)
2 0.000 0.000 0.000 0.000 managers.py:132(__init__)
1 0.000 0.000 62.198 62.198 magic.py:187(<lambda>)
2 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_datetime_array}
1 0.000 0.000 0.000 0.000 splitinput.py:53(split_user_input)
6 0.000 0.000 0.001 0.000 base.py:5559(ensure_index)
8 0.000 0.000 0.000 0.000 base.py:5656(maybe_extract_name)
1 0.000 0.000 0.000 0.000 prefilter.py:255(find_handler)
1 0.000 0.000 0.000 0.000 inputtransformer2.py:576(do_token_transforms)
8 0.000 0.000 0.000 0.000 common.py:422(is_timedelta64_dtype)
4 0.000 0.000 0.000 0.000 common.py:1330(is_bool_dtype)
1 0.000 0.000 0.000 0.000 inputtransformer2.py:544(do_one_token_transform)
1 0.000 0.000 0.000 0.000 prefilter.py:271(prefilter_line)
2 0.000 0.000 0.000 0.000 range.py:134(_simple_new)
2 0.000 0.000 0.000 0.000 blocks.py:2701(make_block)
4 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:1033(_handle_fromlist)
2 0.000 0.000 0.000 0.000 base.py:463(_simple_new)
3 0.000 0.000 0.000 0.000 threading.py:1089(is_alive)
1 0.000 0.000 62.198 62.198 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 base.py:5650(default_index)
2 0.000 0.000 0.000 0.000 common.py:190(all_none)
2 0.000 0.000 0.000 0.000 base.py:2007(is_all_dates)
29 0.000 0.000 0.000 0.000 enum.py:683(value)
18 0.000 0.000 0.000 0.000 managers.py:214(<genexpr>)
29 0.000 0.000 0.000 0.000 {method 'span' of 're.Match' objects}
8 0.000 0.000 0.000 0.000 common.py:905(is_datetime64_any_dtype)
2 0.000 0.000 0.000 0.000 construction.py:289(_prep_ndarray)
2 0.000 0.000 0.001 0.000 construction.py:450(_get_axes)
2 0.000 0.000 0.000 0.000 inputtransformer2.py:79(__call__)
1 0.000 0.000 0.000 0.000 interactiveshell.py:2385(find_line_magic)
1 0.000 0.000 0.000 0.000 prefilter.py:314(prefilter_lines)
33 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
2 0.000 0.000 0.000 0.000 common.py:224(is_sparse)
1 0.000 0.000 0.000 0.000 base.py:1032(__iter__)
2 0.000 0.000 0.000 0.000 managers.py:683(_consolidate_check)
28 0.000 0.000 0.000 0.000 {method 'isidentifier' of 'str' objects}
3 0.000 0.000 0.000 0.000 threading.py:1035(_wait_for_tstate_lock)
2 0.000 0.000 0.000 0.000 iostream.py:310(_is_master_process)
4 0.000 0.000 0.000 0.000 {method 'fill' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 inputtransformer2.py:369(find)
1 0.000 0.000 0.000 0.000 interactiveshell.py:2330(get_local_scope)
4 0.000 0.000 0.000 0.000 common.py:750(is_signed_integer_dtype)
4 0.000 0.000 0.000 0.000 common.py:806(is_unsigned_integer_dtype)
2 0.000 0.000 0.000 0.000 base.py:5672(_maybe_cast_with_dtype)
4 0.000 0.000 0.000 0.000 range.py:320(dtype)
1 0.000 0.000 0.000 0.000 codeop.py:142(__call__)
2 0.000 0.000 0.000 0.000 frame.py:585(shape)
2 0.000 0.000 0.000 0.000 timing.py:63(clock2)
10 0.000 0.000 0.000 0.000 common.py:188(<lambda>)
2 0.000 0.000 0.000 0.000 base.py:2000(inferred_type)
1 0.000 0.000 0.000 0.000 splitinput.py:110(__init__)
2 0.000 0.000 0.000 0.000 common.py:696(is_integer_dtype)
1 0.000 0.000 0.000 0.000 inputtransformer2.py:34(leading_indent)
2 0.000 0.000 0.000 0.000 builtin_trap.py:39(__enter__)
3 0.000 0.000 0.000 0.000 iostream.py:93(_event_pipe)
1 0.000 0.000 62.197 62.197 {built-in method builtins.eval}
4 0.000 0.000 0.000 0.000 base.py:544(_reset_identity)
3 0.000 0.000 0.000 0.000 {method 'acquire' of '_thread.lock' objects}
6 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
1 0.000 0.000 0.000 0.000 compilerop.py:96(ast_parse)
1 0.000 0.000 0.000 0.000 interactiveshell.py:2397(find_magic)
2 0.000 0.000 0.000 0.000 prefilter.py:234(get_handler_by_name)
2 0.000 0.000 0.000 0.000 builtin_trap.py:46(__exit__)
1 0.000 0.000 62.198 62.198 <decorator-gen-55>:1(time)
1 0.000 0.000 0.000 0.000 encoding.py:21(get_stream_enc)
10 0.000 0.000 0.000 0.000 common.py:183(classes_and_not_datetimelike)
2 0.000 0.000 0.000 0.000 base.py:1182(name)
2 0.000 0.000 0.000 0.000 managers.py:138(<listcomp>)
1 0.000 0.000 0.000 0.000 prefilter.py:458(check)
4 0.000 0.000 0.000 0.000 _dtype.py:307(_name_includes_bit_suffix)
2 0.000 0.000 0.000 0.000 common.py:194(is_object_dtype)
2 0.000 0.000 0.000 0.000 {method 'any' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 prefilter.py:414(check)
7 0.000 0.000 0.000 0.000 base.py:3870(_values)
4 0.000 0.000 0.000 0.000 managers.py:323(<genexpr>)
2 0.000 0.000 0.000 0.000 {built-in method builtins.sum}
1 0.000 0.000 0.000 0.000 inputtransformer2.py:94(cell_magic)
1 0.000 0.000 0.000 0.000 prefilter.py:246(prefilter_line_info)
4 0.000 0.000 0.000 0.000 _dtype.py:24(_kind_name)
2 0.000 0.000 0.000 0.000 managers.py:675(is_consolidated)
2 0.000 0.000 0.000 0.000 managers.py:684(<listcomp>)
2 0.000 0.000 0.000 0.000 managers.py:977(_consolidate_inplace)
3 0.000 0.000 0.000 0.000 {method 'splitlines' of 'str' objects}
1 0.000 0.000 0.000 0.000 interactiveshell.py:3195(transform_ast)
2 0.000 0.000 0.000 0.000 iostream.py:323(_schedule_flush)
6 0.000 0.000 0.000 0.000 {built-in method time.perf_counter}
1 0.000 0.000 0.000 0.000 prefilter.py:264(transform_line)
1 0.000 0.000 0.000 0.000 inputtransformer2.py:214(find)
1 0.000 0.000 0.000 0.000 prefilter.py:426(check)
2 0.000 0.000 0.000 0.000 blocks.py:311(shape)
1 0.000 0.000 0.000 0.000 prefilter.py:482(check)
2 0.000 0.000 0.000 0.000 {built-in method time.time}
2 0.000 0.000 0.000 0.000 {pandas._libs.algos.ensure_object}
1 0.000 0.000 0.000 0.000 tokenize.py:612(generate_tokens)
1 0.000 0.000 0.000 0.000 inputtransformer2.py:429(find)
4 0.000 0.000 0.000 0.000 common.py:194(<genexpr>)
2 0.000 0.000 0.000 0.000 _methods.py:53(_any)
2 0.000 0.000 0.000 0.000 {built-in method nt.getpid}
1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects}
2 0.000 0.000 0.000 0.000 {built-in method builtins.all}
1 0.000 0.000 0.000 0.000 inputtransformer2.py:246(find)
3 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.000 0.000 0.000 0.000 py3compat.py:26(cast_unicode)
3 0.000 0.000 0.000 0.000 {method 'append' of 'collections.deque' objects}
2 0.000 0.000 0.000 0.000 {pandas._libs.lib.is_iterator}
3 0.000 0.000 0.000 0.000 threading.py:529(is_set)
2 0.000 0.000 0.000 0.000 base.py:590(dtype)
2 0.000 0.000 0.000 0.000 range.py:214(start)
1 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 prefilter.py:440(check)
3 0.000 0.000 0.000 0.000 numeric.py:237(inferred_type)
1 0.000 0.000 0.000 0.000 {method 'groups' of 're.Match' objects}
1 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects}
1 0.000 0.000 0.000 0.000 prefilter.py:147(transformers)
1 0.000 0.000 0.000 0.000 prefilter.py:549(handle)
2 0.000 0.000 0.000 0.000 numeric.py:81(_validate_dtype)
2 0.000 0.000 0.000 0.000 range.py:237(stop)
3 0.000 0.000 0.000 0.000 {method 'strip' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'isspace' of 'str' objects}
1 0.000 0.000 0.000 0.000 interactiveshell.py:705(get_ipython)
2 0.000 0.000 0.000 0.000 range.py:260(step)
1 0.000 0.000 0.000 0.000 {built-in method sys._getframe}
1 0.000 0.000 0.000 0.000 prefilter.py:183(checkers)
1 0.000 0.000 0.000 0.000 {method 'lstrip' of 'str' objects}
1 0.000 0.000 0.000 0.000 {built-in method builtins.iter} Also, replacing idx with integers like so solves the issue, so it's not a matter of non-unique/non-sorted index idx = np.random.choice(list(range(len(ascii_letters))), size=10000) |
it is looping over columns and doing assert_series_equal for each column, which means it is re-doing assert_index_equal 1000 times, which when you've attached an object-dtype index gets expensive. |
I'm not sure it's doing s = pd.Series(np.ones(10000), index=np.random.choice(list(ascii_letters), 10000))
%timeit pd.testing.assert_series_equal(s, s)
# 36.1 ms ± 962 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.testing.assert_index_equal(s.index, s.index)
# 9.12 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) It looks to me like there is an optimization for s_index = s.index.copy()
%timeit pd.testing.assert_index_equal(s.index, s_index)
# 268 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) |
Yes there is an optimization that assert_index_equal (really, Index.equals) does when it knows it shares the same data, or just is the same object. In the original example you created two separate Indexes that happen to share the same data, but Index.equals doesn't know that. |
yea i think for _array_equivalent_object we should infer_dtype on each side - if we have certain things (eg mixed) then take the slow path otherwise i think we use the fast path (which is in cython) the slow path is really slow because it's materializing everything and doing a lot of comparisons |
could just do |
Sure, but this shouldn't have much to do with comparisons taking a while, since arr = np.random.choice(np.array(list(ascii_letters)), size=10000).astype(object)
str_idx1 = pd.Index(arr.copy())
str_idx2 = pd.Index(arr.copy())
intidx_s1 = pd.Series(np.ones(10000))
intidx_s2 = pd.Series(np.ones(10000))
stridx_s1 = pd.Series(np.ones(10000), index=str_idx1)
stridx_s2 = pd.Series(np.ones(10000), index=str_idx2)
%timeit pd.testing.assert_index_equal(str_idx1, str_idx2)
# 262 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.testing.assert_index_equal(str_idx1, str_idx2, exact=False) # Like in assert_series_equal
# 257 µs ± 2.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.testing.assert_series_equal(intidx_s1, intidx_s2)
# 147 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.testing.assert_series_equal(stridx_s1, stridx_s2)
# 35.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) According to |
Looks like inside |
Ohh, I see I made a mistake in my last comment, and used the wrong argument: %timeit pd.testing.assert_index_equal(str_idx1, str_idx2, check_exact=False)
# 38.5 ms ± 875 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Also, I think this added some confusion on my end: pd._testing._testing.assert_almost_equal is pd._testing.assert_almost_equal
# False |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Comparisons which can take less than a second, take 50 seconds when they are combined:
Problem description
It seems weird that the whole process slows down when the index has strings in it. It's especially weird since comparing the index itself is fairly fast. This may be a regression, since I've never noticed this before, and this seem very noticeable.
This might have something to do with #38091
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: