Improve DataFrame.select_dtypes scaling to wide data frames #28317

TomAugspurger · 2019-09-06T15:26:08Z

Running select_dtypes for a variety of lengths.

import numpy as np
import pandas as pd
from timeit import default_timer as tic

ns = [0, 10, 100, 1_000, 10_000]
times = []

for n in ns:
    df = pd.DataFrame(np.random.randn(10, n))
    t0 = tic()
    df.select_dtypes(include='int')
    t1 = tic()

    times.append([t1 - t0])

df = pd.DataFrame(times, columns=['include'], index=ns)
df.plot()

This looks O(n) in the number of columns. I think that can be improved (to whatever set intersection is)

Edit: maybe it's O(log(n)), I never took CS :)

datajanko · 2019-09-07T05:45:03Z

For up to 10k columns I saw the same behavior as the one you described. for 10k columns it took me 3 seconds. For 100k columns it takes me 160 seconds (instead of the expected roughly 30 seconds).
But I profiled the output:

Profile

         27802797 function calls (27602766 primitive calls) in 155.660 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100001   89.922    0.001   90.608    0.001 {pandas._libs.lib.infer_dtype}
   100000   29.974    0.000   29.974    0.000 {pandas._libs.algos.ensure_object}
   100000   15.916    0.000  138.494    0.001 cast.py:93(maybe_downcast_to_dtype)
   200005    1.329    0.000    1.329    0.000 {method 'reduce' of 'numpy.ufunc' objects}
  3100361    1.309    0.000    1.424    0.000 {built-in method builtins.getattr}
  4800567    1.252    0.000    2.520    0.000 {built-in method builtins.isinstance}
   100000    0.914    0.000  152.612    0.002 indexing.py:313(_setitem_with_indexer)
   700025    0.707    0.000    1.226    0.000 abc.py:180(__instancecheck__)
   100004    0.574    0.000  148.195    0.001 managers.py:353(apply)
   100000    0.552    0.000  146.671    0.001 blocks.py:841(setitem)
   200000    0.548    0.000    2.162    0.000 fromnumeric.py:69(_wrapreduction)
  1800238    0.526    0.000    1.268    0.000 generic.py:7(_check)
  1400045    0.519    0.000    0.519    0.000 _weakrefset.py:70(__contains__)
   100004    0.482    0.000    3.456    0.000 blocks.py:3195(get_block_type)
        1    0.413    0.413  155.660  155.660 frame.py:3302(select_dtypes)
   100000    0.367    0.000    1.180    0.000 indexing.py:284(_has_valid_positional_setitem_indexer)
   500017    0.343    0.000    1.267    0.000 {pandas._libs.lib.is_list_like}
   800061    0.318    0.000    0.318    0.000 {built-in method builtins.hasattr}
   100008    0.305    0.000    0.435    0.000 managers.py:1470(__init__)
   100010    0.292    0.000    0.686    0.000 _dtype.py:319(_name_get)
1400113/1200087    0.292    0.000    0.380    0.000 {built-in method builtins.len}
   200012    0.283    0.000    0.283    0.000 generic.py:5181(__setattr__)
   100000    0.269    0.000    0.351    0.000 indexing.py:1295(_tuplify)
   100000    0.268    0.000  155.078    0.002 indexing.py:199(__setitem__)
   100008    0.268    0.000    4.407    0.000 blocks.py:3241(make_block)
   300003    0.262    0.000    0.563    0.000 {pandas._libs.lib.is_scalar}
   300007    0.252    0.000    0.342    0.000 generic.py:413(_get_axis_name)
   400051    0.251    0.000    1.466    0.000 base.py:231(is_dtype)
   100008    0.243    0.000    0.456    0.000 blocks.py:120(__init__)
   300030    0.242    0.000    0.446    0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
   100011    0.223    0.000    0.427    0.000 common.py:255(is_sparse)
   100000    0.212    0.000    2.125    0.000 indexing.py:168(_get_setitem_indexer)
   100004    0.206    0.000    0.423    0.000 blocks.py:3273(_extend_blocks)
   900144    0.199    0.000    0.199    0.000 {built-in method builtins.issubclass}
   300007    0.188    0.000    0.643    0.000 generic.py:426(_get_axis)
   100000    0.173    0.000    1.985    0.000 blocks.py:726(_try_coerce_args)
   200000    0.172    0.000    0.881    0.000 blocks.py:2738(_can_hold_element)
   100000    0.161    0.000    0.516    0.000 generic.py:3324(_maybe_update_cacher)
   100000    0.157    0.000    1.454    0.000 indexing.py:1996(_validate_key)
   100009    0.152    0.000    0.518    0.000 dtypes.py:1092(is_dtype)
   100002    0.139    0.000    0.830    0.000 common.py:99(is_bool_indexer)
   100010    0.138    0.000    0.394    0.000 numerictypes.py:365(issubdtype)
   200020    0.135    0.000    0.243    0.000 numerictypes.py:293(issubclass_)
   100000    0.131    0.000    4.570    0.000 blocks.py:257(make_block)
   100000    0.130    0.000  148.324    0.001 managers.py:559(setitem)
   100000    0.122    0.000    1.362    0.000 fromnumeric.py:2664(prod)
   200000    0.117    0.000    0.117    0.000 fromnumeric.py:70(<dictcomp>)
   100000    0.117    0.000    0.440    0.000 indexing.py:2065(_validate_integer)
   100011    0.115    0.000    0.116    0.000 generic.py:5162(__getattr__)
   100034    0.115    0.000    0.280    0.000 common.py:1743(is_extension_array_dtype)
   200000    0.114    0.000    0.680    0.000 cast.py:515(maybe_infer_dtype_type)
   100003    0.111    0.000    0.389    0.000 generic.py:5236(_protect_consolidate)
   100000    0.109    0.000    1.596    0.000 indexing.py:2162(_convert_to_indexer)
   100003    0.101    0.000    0.257    0.000 generic.py:5249(f)
   100000    0.098    0.000    0.153    0.000 managers.py:682(is_view)
   100000    0.097    0.000    0.097    0.000 frame.py:3430(is_dtype_instance_mapper)
   100038    0.097    0.000    0.137    0.000 dtypes.py:83(find)
   100000    0.095    0.000    1.017    0.000 fromnumeric.py:2083(any)
   100000    0.095    0.000  138.696    0.001 blocks.py:745(_try_coerce_and_cast_result)
   100000    0.094    0.000  138.589    0.001 blocks.py:690(_try_cast_result)
   200021    0.089    0.000    0.121    0.000 range.py:652(__len__)
   100002    0.083    0.000    0.147    0.000 missing.py:128(_isna_new)
   100009    0.081    0.000    0.599    0.000 common.py:642(is_interval_dtype)
   100007    0.080    0.000    0.433    0.000 dtypes.py:912(is_dtype)
   100007    0.078    0.000    0.737    0.000 common.py:357(is_categorical)
   100013    0.077    0.000    0.520    0.000 common.py:678(is_categorical_dtype)
   100000    0.076    0.000    0.115    0.000 indexers.py:65(check_setitem_lengths)
100021/100019    0.074    0.000    0.074    0.000 {built-in method numpy.array}
   100000    0.073    0.000    0.291    0.000 generic.py:3393(_check_is_chained_assignment_possible)
   100003    0.071    0.000    0.461    0.000 generic.py:5246(_consolidate_inplace)
   100000    0.070    0.000    0.070    0.000 {method 'ravel' of 'numpy.ndarray' objects}
   100000    0.067    0.000    0.270    0.000 missing.py:293(notna)
   100002    0.067    0.000    0.067    0.000 {built-in method builtins.any}
   100008    0.065    0.000    0.078    0.000 blocks.py:243(mgr_locs)
   100000    0.065    0.000    0.218    0.000 generic.py:3319(_is_view)
   100000    0.064    0.000    0.064    0.000 indexing.py:1296(<listcomp>)
   100000    0.064    0.000    0.078    0.000 generic.py:3358(_clear_item_cache)
   100000    0.057    0.000    1.237    0.000 indexing.py:2038(_has_valid_setitem_indexer)
   100010    0.054    0.000    0.071    0.000 base.py:5707(ensure_index)
   100000    0.054    0.000    0.229    0.000 indexers.py:39(is_empty_indexer)
   300013    0.052    0.000    0.052    0.000 {method 'get' of 'dict' objects}
   100007    0.051    0.000    0.484    0.000 common.py:608(is_period_dtype)
   100022    0.050    0.000    0.385    0.000 common.py:539(is_datetime64tz_dtype)
   100008    0.050    0.000    0.050    0.000 blocks.py:131(_check_ndim)
   200014    0.049    0.000    0.049    0.000 blocks.py:239(mgr_locs)
   100003    0.046    0.000    0.060    0.000 managers.py:919(consolidate)
   100002    0.044    0.000    0.191    0.000 missing.py:48(isna)
   100000    0.042    0.000    0.042    0.000 blocks.py:178(is_view)
   100000    0.041    0.000    0.041    0.000 series.py:1030(axes)
   100000    0.041    0.000    0.052    0.000 indexing.py:2430(convert_missing_indexer)
   100000    0.041    0.000    0.423    0.000 inference.py:246(is_array_like)
   100000    0.040    0.000    0.040    0.000 {built-in method pandas._libs.missing.checknull}
   100002    0.039    0.000    0.051    0.000 common.py:352(apply_if_callable)
   100000    0.038    0.000    0.038    0.000 generic.py:3413(_check_setitem_copy)
   100002    0.037    0.000    0.275    0.000 indexers.py:13(is_list_like_indexer)
   200003    0.033    0.000    0.033    0.000 {method 'items' of 'dict' objects}
   100004    0.030    0.000    0.030    0.000 managers.py:418(<dictcomp>)
   200005    0.030    0.000    0.030    0.000 {pandas._libs.lib.is_integer}
   200008    0.026    0.000    0.026    0.000 managers.py:1606(_consolidate_inplace)
   100002    0.023    0.000    0.023    0.000 {pandas._libs.lib.is_float}
   100000    0.022    0.000    0.022    0.000 indexers.py:29(is_scalar_indexer)
   100002    0.018    0.000    0.018    0.000 base.py:717(ndim)
   100004    0.017    0.000    0.017    0.000 {method 'append' of 'list' objects}
   100002    0.015    0.000    0.015    0.000 managers.py:1600(is_consolidated)
   100000    0.015    0.000    0.015    0.000 series.py:1057(_is_mixed_type)
   100000    0.015    0.000    0.015    0.000 {method 'clear' of 'dict' objects}
   100000    0.014    0.000    0.014    0.000 blocks.py:712(_coerce_values)
   100000    0.013    0.000    0.013    0.000 blocks.py:741(_try_coerce_result)
   100002    0.012    0.000    0.012    0.000 {built-in method builtins.callable}
        2    0.001    0.001    0.001    0.001 missing.py:219(_isna_ndarraylike)
        1    0.000    0.000    0.000    0.000 {pandas._libs.algos.take_1d_object_object}
        7    0.000    0.000    0.000    0.000 {built-in method numpy.empty}
        1    0.000    0.000  155.660  155.660 <string>:1(<module>)
        8    0.000    0.000    0.001    0.000 series.py:194(__init__)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.arange}
        4    0.000    0.000    0.000    0.000 {method 'copy' of 'numpy.ndarray' objects}
        4    0.000    0.000    0.001    0.000 construction.py:630(sanitize_array)
       33    0.000    0.000    0.000    0.000 common.py:1886(_is_dtype_type)
        1    0.000    0.000  155.660  155.660 {built-in method builtins.exec}
       13    0.000    0.000    0.000    0.000 common.py:2020(pandas_dtype)
        3    0.000    0.000    0.001    0.000 algorithms.py:1608(take_nd)
        3    0.000    0.000    0.000    0.000 algorithms.py:1481(_get_take_nd_function)
        3    0.000    0.000    0.000    0.000 cast.py:986(maybe_cast_to_datetime)
        4    0.000    0.000    0.000    0.000 {method 'fill' of 'numpy.ndarray' objects}
       17    0.000    0.000    0.000    0.000 {method 'format' of 'str' objects}
        9    0.000    0.000    0.000    0.000 generic.py:162(__init__)
       15    0.000    0.000    0.000    0.000 base.py:180(construct_from_string)
        4    0.000    0.000    0.000    0.000 construction.py:759(_try_cast)
        2    0.000    0.000    0.002    0.001 generic.py:6101(fillna)
        8    0.000    0.000    0.000    0.000 series.py:416(_set_axis)
       17    0.000    0.000    0.000    0.000 series.py:453(name)
        1    0.000    0.000    0.002    0.002 __init__.py:1289(wrapper)
        1    0.000    0.000    0.001    0.001 generic.py:5603(dtypes)
        1    0.000    0.000    0.000    0.000 base.py:277(__new__)
        1    0.000    0.000    0.000    0.000 indexers.py:161(maybe_convert_indices)
        1    0.000    0.000    0.000    0.000 managers.py:1274(_slice_take_blocks_ax0)
        2    0.000    0.000    0.000    0.000 {pandas._libs.algos.take_1d_int64_int64}
        2    0.000    0.000    0.000    0.000 cast.py:298(maybe_promote)
        8    0.000    0.000    0.000    0.000 common.py:951(is_integer_dtype)
      2/1    0.000    0.000    0.000    0.000 common.py:1931(infer_dtype_from_object)
        2    0.000    0.000    0.000    0.000 cast.py:880(maybe_infer_to_datetimelike)
        1    0.000    0.000    0.001    0.001 generic.py:3524(take)
       12    0.000    0.000    0.000    0.000 common.py:225(is_object_dtype)
        2    0.000    0.000    0.000    0.000 cast.py:866(maybe_castable)
        2    0.000    0.000    0.000    0.000 base.py:569(_simple_new)
        1    0.000    0.000    0.001    0.001 managers.py:1376(take)
        2    0.000    0.000    0.000    0.000 cast.py:1195(construct_1d_arraylike_from_scalar)
        2    0.000    0.000    0.000    0.000 blocks.py:561(_astype)
        1    0.000    0.000    0.000    0.000 {built-in method _operator.and_}
        5    0.000    0.000    0.000    0.000 dtypes.py:717(construct_from_string)
        2    0.000    0.000    0.000    0.000 cast.py:384(infer_dtype_from_scalar)
       24    0.000    0.000    0.000    0.000 common.py:211(<lambda>)
        1    0.000    0.000    0.001    0.001 managers.py:255(get_dtypes)
        1    0.000    0.000    0.000    0.000 base.py:1185(__iter__)
        1    0.000    0.000    0.000    0.000 base.py:652(_shallow_copy_with_infer)
        3    0.000    0.000    0.000    0.000 range.py:181(_data)
        1    0.000    0.000    0.001    0.001 indexing.py:1787(_getitem_axis)
        4    0.000    0.000    0.000    0.000 blocks.py:768(copy)
        5    0.000    0.000    0.000    0.000 managers.py:167(shape)
        5    0.000    0.000    0.000    0.000 managers.py:1585(internal_values)
        2    0.000    0.000    0.002    0.001 series.py:4326(fillna)
       12    0.000    0.000    0.000    0.000 common.py:1850(_get_dtype)
        2    0.000    0.000    0.000    0.000 inference.py:327(is_dict_like)
        2    0.000    0.000    0.002    0.001 __init__.py:1287(<lambda>)
        1    0.000    0.000    0.000    0.000 base.py:831(array)
        5    0.000    0.000    0.000    0.000 generic.py:5145(__finalize__)
        2    0.000    0.000    0.000    0.000 generic.py:5742(astype)
        1    0.000    0.000    0.001    0.001 indexing.py:803(_getitem_tuple)
        9    0.000    0.000    0.000    0.000 managers.py:1558(dtype)
       12    0.000    0.000    0.000    0.000 series.py:460(name)
       24    0.000    0.000    0.000    0.000 common.py:209(classes)
        2    0.000    0.000    0.000    0.000 frame.py:3403(<lambda>)
        1    0.000    0.000    0.000    0.000 indexing.py:901(_getitem_lowerdim)
        1    0.000    0.000    0.000    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 numpy_.py:39(__init__)
        3    0.000    0.000    0.000    0.000 base.py:3901(values)
        1    0.000    0.000    0.000    0.000 indexing.py:2377(check_bool_indexer)
        1    0.000    0.000    0.000    0.000 numeric.py:2551(array_equal)
        9    0.000    0.000    0.000    0.000 common.py:214(classes_and_not_datetimelike)
        5    0.000    0.000    0.000    0.000 common.py:508(is_datetime64_dtype)
        2    0.000    0.000    0.000    0.000 common.py:1284(is_datetime_or_timedelta_dtype)
        3    0.000    0.000    0.000    0.000 common.py:1684(is_extension_type)
        1    0.000    0.000    0.000    0.000 common.py:1995(_validate_date_like_dtype)
        1    0.000    0.000    0.000    0.000 dtypes.py:1040(construct_from_string)
        4    0.000    0.000    0.000    0.000 inference.py:386(is_hashable)
        2    0.000    0.000    0.000    0.000 cast.py:574(invalidate_string_dtypes)
        2    0.000    0.000    0.000    0.000 _validators.py:337(validate_fillna_kwargs)
        1    0.000    0.000    0.000    0.000 base.py:885(take)
        1    0.000    0.000    0.000    0.000 base.py:4297(_can_hold_identifiers_and_holds_name)
        1    0.000    0.000    0.000    0.000 base.py:4377(equals)
        1    0.000    0.000    0.000    0.000 numeric.py:47(__new__)
        1    0.000    0.000    0.000    0.000 range.py:410(_shallow_copy)
        2    0.000    0.000    0.001    0.001 blocks.py:400(fillna)
        1    0.000    0.000    0.000    0.000 managers.py:126(__init__)
       15    0.000    0.000    0.000    0.000 managers.py:169(<genexpr>)
        8    0.000    0.000    0.000    0.000 managers.py:171(ndim)
        1    0.000    0.000    0.000    0.000 managers.py:203(_is_single_block)
        1    0.000    0.000    0.000    0.000 managers.py:216(_rebuild_blknos_and_blklocs)
        2    0.000    0.000    0.000    0.000 managers.py:580(astype)
        1    0.000    0.000    0.000    0.000 managers.py:1224(reindex_indexer)
       16    0.000    0.000    0.000    0.000 managers.py:1523(_block)
        3    0.000    0.000    0.000    0.000 {method 'astype' of 'numpy.ndarray' objects}
        9    0.000    0.000    0.000    0.000 common.py:219(<lambda>)
        3    0.000    0.000    0.000    0.000 common.py:1825(_is_dtype)
        3    0.000    0.000    0.000    0.000 cast.py:1264(construct_1d_ndarray_preserving_na)
        1    0.000    0.000    0.000    0.000 frame.py:397(__init__)
        1    0.000    0.000    0.000    0.000 frame.py:3382(_get_info_slice)
        1    0.000    0.000    0.000    0.000 generic.py:227(_validate_dtype)
        1    0.000    0.000    0.001    0.001 indexing.py:1410(__getitem__)
        1    0.000    0.000    0.001    0.001 indexing.py:1435(_getbool_axis)
        1    0.000    0.000    0.000    0.000 indexing.py:1727(_is_scalar_access)
        2    0.000    0.000    0.000    0.000 blocks.py:558(astype)
        1    0.000    0.000    0.000    0.000 managers.py:340(_verify_integrity)
        8    0.000    0.000    0.000    0.000 series.py:443(_set_subtyp)
        9    0.000    0.000    0.000    0.000 series.py:467(dtype)
        5    0.000    0.000    0.000    0.000 series.py:559(_values)
        1    0.000    0.000    0.000    0.000 series.py:886(__array__)
        3    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:416(parent)
        4    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
    13/11    0.000    0.000    0.000    0.000 numeric.py:469(asarray)
        2    0.000    0.000    0.000    0.000 {pandas._libs.lib.values_from_object}
        4    0.000    0.000    0.000    0.000 common.py:577(is_timedelta64_dtype)
        1    0.000    0.000    0.000    0.000 common.py:814(is_datetimelike)
        4    0.000    0.000    0.000    0.000 inference.py:353(<genexpr>)
        1    0.000    0.000    0.000    0.000 cast.py:1306(maybe_cast_to_integer_array)
        4    0.000    0.000    0.000    0.000 base.py:723(__len__)
        8    0.000    0.000    0.000    0.000 numeric.py:134(is_all_dates)
        1    0.000    0.000    0.000    0.000 range.py:196(_int64index)
        1    0.000    0.000    0.000    0.000 indexing.py:229(_has_valid_tuple)
        1    0.000    0.000    0.000    0.000 indexing.py:826(_multi_take_opportunity)
        3    0.000    0.000    0.000    0.000 indexing.py:1412(<genexpr>)
        4    0.000    0.000    0.000    0.000 blocks.py:267(make_block_same_class)
       12    0.000    0.000    0.000    0.000 blocks.py:343(dtype)
        1    0.000    0.000    0.000    0.000 blocks.py:2771(__init__)
        4    0.000    0.000    0.000    0.000 arrays.py:7(extract_array)
        1    0.000    0.000    0.000    0.000 managers.py:256(<listcomp>)
        2    0.000    0.000    0.001    0.001 managers.py:574(fillna)
        3    0.000    0.000    0.000    0.000 managers.py:646(is_consolidated)
        1    0.000    0.000    0.000    0.000 managers.py:654(_consolidate_check)
        2    0.000    0.000    0.000    0.000 {method 'rpartition' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
        1    0.000    0.000    0.000    0.000 {method 'all' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'take' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.datetime_data}
        8    0.000    0.000    0.000    0.000 {pandas._libs.lib.is_bool}
        2    0.000    0.000    0.000    0.000 common.py:711(is_string_dtype)
        3    0.000    0.000    0.000    0.000 common.py:862(is_dtype_equal)
        1    0.000    0.000    0.000    0.000 common.py:1006(is_signed_integer_dtype)
        2    0.000    0.000    0.000    0.000 common.py:1466(needs_i8_conversion)
        1    0.000    0.000    0.000    0.000 common.py:1585(is_float_dtype)
        2    0.000    0.000    0.000    0.000 common.py:1619(is_bool_dtype)
        3    0.000    0.000    0.000    0.000 {pandas._libs.algos.ensure_int64}
        1    0.000    0.000    0.000    0.000 dtypes.py:866(construct_from_string)
        2    0.000    0.000    0.000    0.000 inference.py:120(is_iterator)
        1    0.000    0.000    0.000    0.000 missing.py:393(array_equivalent)
        1    0.000    0.000    0.000    0.000 __init__.py:892(_align_method_SERIES)
        1    0.000    0.000    0.000    0.000 __init__.py:1252(na_op)
        2    0.000    0.000    0.000    0.000 common.py:297(maybe_iterable_to_list)
        1    0.000    0.000    0.000    0.000 base.py:613(_get_attributes_dict)
        2    0.000    0.000    0.000    0.000 base.py:700(_reset_identity)
        2    0.000    0.000    0.000    0.000 base.py:4006(_internal_get_values)
        1    0.000    0.000    0.000    0.000 numpy_.py:165(__array__)
        1    0.000    0.000    0.000    0.000 sparse.py:223(construct_from_string)
        1    0.000    0.000    0.000    0.000 frame.py:491(axes)
        3    0.000    0.000    0.000    0.000 generic.py:400(_get_axis_number)
        1    0.000    0.000    0.000    0.000 generic.py:430(_get_block_manager_axis)
        5    0.000    0.000    0.000    0.000 generic.py:510(ndim)
        1    0.000    0.000    0.000    0.000 generic.py:3384(_set_is_copy)
        1    0.000    0.000    0.000    0.000 numeric.py:82(_shallow_copy)
        1    0.000    0.000    0.000    0.000 range.py:342(dtype)
        2    0.000    0.000    0.000    0.000 range.py:467(equals)
        3    0.000    0.000    0.000    0.000 indexing.py:243(<genexpr>)
        1    0.000    0.000    0.000    0.000 managers.py:132(<listcomp>)
        1    0.000    0.000    0.000    0.000 managers.py:325(__len__)
        1    0.000    0.000    0.000    0.000 {pandas._libs.internals.get_blkno_placements}
        1    0.000    0.000    0.000    0.000 managers.py:2002(_preprocess_slice_or_indexer)
        1    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.all}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.hash}
        1    0.000    0.000    0.000    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
        2    0.000    0.000    0.000    0.000 inspect.py:73(isclass)
        2    0.000    0.000    0.000    0.000 numeric.py:541(asanyarray)
        4    0.000    0.000    0.000    0.000 _methods.py:42(_any)
        1    0.000    0.000    0.000    0.000 {pandas._libs.lib.item_from_zerodim}
        2    0.000    0.000    0.000    0.000 common.py:1163(is_datetime64_any_dtype)
        1    0.000    0.000    0.000    0.000 common.py:1203(is_datetime64_ns_dtype)
        1    0.000    0.000    0.000    0.000 common.py:1253(is_timedelta64_ns_dtype)
        1    0.000    0.000    0.000    0.000 {pandas._libs.algos.ensure_platform_int}
        4    0.000    0.000    0.000    0.000 _validators.py:231(validate_bool_kwarg)
        1    0.000    0.000    0.000    0.000 __init__.py:81(get_op_result_name)
        1    0.000    0.000    0.000    0.000 base.py:617(<dictcomp>)
        1    0.000    0.000    0.000    0.000 base.py:681(is_)
        1    0.000    0.000    0.000    0.000 base.py:1826(is_object)
        1    0.000    0.000    0.000    0.000 base.py:1829(is_categorical)
        1    0.000    0.000    0.000    0.000 base.py:4118(_coerce_to_ndarray)
        1    0.000    0.000    0.000    0.000 numpy_.py:122(__init__)
        1    0.000    0.000    0.000    0.000 generic.py:185(_init_mgr)
        2    0.000    0.000    0.000    0.000 generic.py:486(_info_axis)
        1    0.000    0.000    0.000    0.000 range.py:236(start)
        1    0.000    0.000    0.000    0.000 range.py:259(stop)
        1    0.000    0.000    0.000    0.000 indexing.py:242(_is_nested_tuple_indexer)
        2    0.000    0.000    0.000    0.000 indexing.py:1710(_validate_key)
        2    0.000    0.000    0.000    0.000 blocks.py:203(external_values)
        5    0.000    0.000    0.000    0.000 blocks.py:207(internal_values)
        2    0.000    0.000    0.000    0.000 indexing.py:2488(is_label_like)
        2    0.000    0.000    0.000    0.000 blocks.py:188(is_categorical_astype)
        1    0.000    0.000    0.000    0.000 managers.py:342(<genexpr>)
        2    0.000    0.000    0.000    0.000 managers.py:1582(external_values)
        5    0.000    0.000    0.000    0.000 series.py:399(_constructor)
        2    0.000    0.000    0.000    0.000 series.py:517(values)
        1    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'isdisjoint' of 'frozenset' objects}
        2    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x106c03dd8}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 _methods.py:45(_all)
        2    0.000    0.000    0.000    0.000 common.py:741(condition)
        1    0.000    0.000    0.000    0.000 common.py:1281(<lambda>)
        1    0.000    0.000    0.000    0.000 function.py:42(__call__)
        1    0.000    0.000    0.000    0.000 __init__.py:104(_maybe_match_name)
        1    0.000    0.000    0.000    0.000 __init__.py:1304(<lambda>)
        2    0.000    0.000    0.000    0.000 common.py:306(is_null_slice)
        2    0.000    0.000    0.000    0.000 base.py:747(dtype)
        1    0.000    0.000    0.000    0.000 frame.py:380(_constructor)
        1    0.000    0.000    0.000    0.000 numeric.py:228(inferred_type)
        1    0.000    0.000    0.000    0.000 range.py:282(step)
        2    0.000    0.000    0.000    0.000 indexing.py:841(<genexpr>)
        1    0.000    0.000    0.000    0.000 indexing.py:1759(_get_partial_string_timestamp_match_key)
        2    0.000    0.000    0.000    0.000 managers.py:236(items)
        1    0.000    0.000    0.000    0.000 managers.py:655(<listcomp>)
        1    0.000    0.000    0.000    0.000 managers.py:761(nblocks)
        2    0.000    0.000    0.000    0.000 managers.py:935(_consolidate_inplace)
        4    0.000    0.000    0.000    0.000 managers.py:1549(index)

datajanko · 2019-09-12T18:42:26Z

So I was reading some portions of the code (starting from the top) and started thinking about this. Firstly, I don't think O(log n) is anyhow the right complexity :). We need to determine the dtype of every column. Hence our complexity O(n). Complexity wise, the optimum we could achieve would be O(1), if we had a dictionary, that contains a mapping of dtypes to columns. This however, would essentially require something like static typing or at least keeping track of type changes of columns after operations. I'm assuming we are not going to do this.

So complexity wise it's still O(n). But we can bring the constants down.
Hence, from the profiling, I'd say that the best chances to improve performance are in pandas._libs.lib.infer_dtype
pandas._libs.algos.ensure_object
cast.py:93(maybe_downcast_to_dtype)

However, I'm not having real cython experience here. Could somebody maybe provide some guidelines on how to tackle this issue?

A different approach: I'm assuming that to infer the dtype, a whole array is analyzed. One could maybe add an option approximate=n to select_dtypes which only takes the first n rows do infer the dtype.

TomAugspurger · 2019-09-12T18:55:32Z

Thanks, I think you're right about the complexity stuff. Sorry if I led anyone astray there.

I don't understand your comment about inference though. What exactly are we inferring? We shouldn't be passing the values of a Series / DataFrame to infer_dtype, as we already have the dtypes.

datajanko · 2019-09-12T19:06:01Z

Maybe I don't know pandas internals good enough then. ;) (I was too much thinking in the direction of python not having static typing, but it doesn't make too much sense with e.g. numpy I have to admit ;))

The profile shows, that we are wasting most of our time in infer_dtype. Why are we doing that, if we know the dtypes? I mean, if we have the dtypes, e.g. in a list, it should just be in close to no time to get all the dtypes out.

I think I'll try to investigate the codepath to see, where and why infer_dtype is called.

TomAugspurger · 2019-09-12T19:10:27Z

Thanks. Glancing at the implementation, we do infer_dtype_from_object on the user-provided include and exclude. That may call infer_dtype.

We may also call it in side the include_these.iloc and exclude_these.iloc calls.

FYI @datajanko if you're looking into this I would recommend line_profiler.

%load_ext line_profiler
%lprun -f pd.DataFrame.select_dtypes df.select_dtypes(include=['int'])

gives

Total time: 2.19406 s
File: /Users/taugspurger/sandbox/pandas/pandas/core/frame.py
Function: select_dtypes at line 3371

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  3371                                               def select_dtypes(self, include=None, exclude=None):
  3372                                                   """
  3373                                                   Return a subset of the DataFrame's columns based on the column dtypes.
  3374                                           
  3375                                                   Parameters
  3376                                                   ----------
  3377                                                   include, exclude : scalar or list-like
  3378                                                       A selection of dtypes or strings to be included/excluded. At least
  3379                                                       one of these parameters must be supplied.
  3380                                           
  3381                                                   Returns
  3382                                                   -------
  3383                                                   DataFrame
  3384                                                       The subset of the frame including the dtypes in ``include`` and
  3385                                                       excluding the dtypes in ``exclude``.
  3386                                           
  3387                                                   Raises
  3388                                                   ------
  3389                                                   ValueError
  3390                                                       * If both of ``include`` and ``exclude`` are empty
  3391                                                       * If ``include`` and ``exclude`` have overlapping elements
  3392                                                       * If any kind of string dtype is passed in.
  3393                                           
  3394                                                   Notes
  3395                                                   -----
  3396                                                   * To select all *numeric* types, use ``np.number`` or ``'number'``
  3397                                                   * To select strings you must use the ``object`` dtype, but note that
  3398                                                     this will return *all* object dtype columns
  3399                                                   * See the `numpy dtype hierarchy
  3400                                                     <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>`__
  3401                                                   * To select datetimes, use ``np.datetime64``, ``'datetime'`` or
  3402                                                     ``'datetime64'``
  3403                                                   * To select timedeltas, use ``np.timedelta64``, ``'timedelta'`` or
  3404                                                     ``'timedelta64'``
  3405                                                   * To select Pandas categorical dtypes, use ``'category'``
  3406                                                   * To select Pandas datetimetz dtypes, use ``'datetimetz'`` (new in
  3407                                                     0.20.0) or ``'datetime64[ns, tz]'``
  3408                                           
  3409                                                   Examples
  3410                                                   --------
  3411                                                   >>> df = pd.DataFrame({'a': [1, 2] * 3,
  3412                                                   ...                    'b': [True, False] * 3,
  3413                                                   ...                    'c': [1.0, 2.0] * 3})
  3414                                                   >>> df
  3415                                                           a      b  c
  3416                                                   0       1   True  1.0
  3417                                                   1       2  False  2.0
  3418                                                   2       1   True  1.0
  3419                                                   3       2  False  2.0
  3420                                                   4       1   True  1.0
  3421                                                   5       2  False  2.0
  3422                                           
  3423                                                   >>> df.select_dtypes(include='bool')
  3424                                                      b
  3425                                                   0  True
  3426                                                   1  False
  3427                                                   2  True
  3428                                                   3  False
  3429                                                   4  True
  3430                                                   5  False
  3431                                           
  3432                                                   >>> df.select_dtypes(include=['float64'])
  3433                                                      c
  3434                                                   0  1.0
  3435                                                   1  2.0
  3436                                                   2  1.0
  3437                                                   3  2.0
  3438                                                   4  1.0
  3439                                                   5  2.0
  3440                                           
  3441                                                   >>> df.select_dtypes(exclude=['int'])
  3442                                                          b    c
  3443                                                   0   True  1.0
  3444                                                   1  False  2.0
  3445                                                   2   True  1.0
  3446                                                   3  False  2.0
  3447                                                   4   True  1.0
  3448                                                   5  False  2.0
  3449                                                   """
  3450                                           
  3451         1        178.0    178.0      0.0          def _get_info_slice(obj, indexer):
  3452                                                       """Slice the info axis of `obj` with `indexer`."""
  3453                                                       if not hasattr(obj, "_info_axis_number"):
  3454                                                           msg = "object of type {typ!r} has no info axis"
  3455                                                           raise TypeError(msg.format(typ=type(obj).__name__))
  3456                                                       slices = [slice(None)] * obj.ndim
  3457                                                       slices[obj._info_axis_number] = indexer
  3458                                                       return tuple(slices)
  3459                                           
  3460         1         12.0     12.0      0.0          if not is_list_like(include):
  3461                                                       include = (include,) if include is not None else ()
  3462         1        178.0    178.0      0.0          if not is_list_like(exclude):
  3463         1          1.0      1.0      0.0              exclude = (exclude,) if exclude is not None else ()
  3464                                           
  3465         1          3.0      3.0      0.0          selection = (frozenset(include), frozenset(exclude))
  3466                                           
  3467         1          2.0      2.0      0.0          if not any(selection):
  3468                                                       raise ValueError("at least one of include or exclude must be nonempty")
  3469                                           
  3470                                                   # convert the myriad valid dtypes object to a single representation
  3471         1        161.0    161.0      0.0          include = frozenset(infer_dtype_from_object(x) for x in include)
  3472         1          3.0      3.0      0.0          exclude = frozenset(infer_dtype_from_object(x) for x in exclude)
  3473         3          5.0      1.7      0.0          for dtypes in (include, exclude):
  3474         2         17.0      8.5      0.0              invalidate_string_dtypes(dtypes)
  3475                                           
  3476                                                   # can't both include AND exclude!
  3477         1          3.0      3.0      0.0          if not include.isdisjoint(exclude):
  3478                                                       raise ValueError(
  3479                                                           "include and exclude overlap on {inc_ex}".format(
  3480                                                               inc_ex=(include & exclude)
  3481                                                           )
  3482                                                       )
  3483                                           
  3484                                                   # empty include/exclude -> defaults to True
  3485                                                   # three cases (we've already raised if both are empty)
  3486                                                   # case 1: empty include, nonempty exclude
  3487                                                   # we have True, True, ... True for include, same for exclude
  3488                                                   # in the loop below we get the excluded
  3489                                                   # and when we call '&' below we get only the excluded
  3490                                                   # case 2: nonempty include, empty exclude
  3491                                                   # same as case 1, but with include
  3492                                                   # case 3: both nonempty
  3493                                                   # the "union" of the logic of case 1 and case 2:
  3494                                                   # we get the included and excluded, and return their logical and
  3495         1        602.0    602.0      0.0          include_these = Series(not bool(include), index=self.columns)
  3496         1        259.0    259.0      0.0          exclude_these = Series(not bool(exclude), index=self.columns)
  3497                                           
  3498         1          2.0      2.0      0.0          def is_dtype_instance_mapper(idx, dtype):
  3499                                                       return idx, functools.partial(issubclass, dtype.type)
  3500                                           
  3501         1          3.0      3.0      0.0          for idx, f in itertools.starmap(
  3502     10001      32227.0      3.2      1.5              is_dtype_instance_mapper, enumerate(self.dtypes)
  3503                                                   ):
  3504     10000      11762.0      1.2      0.5              if include:  # checks for the case of empty include or exclude
  3505     10000    2130846.0    213.1     97.1                  include_these.iloc[idx] = any(map(f, include))
  3506     10000      15794.0      1.6      0.7              if exclude:
  3507                                                           exclude_these.iloc[idx] = not any(map(f, exclude))
  3508                                           
  3509         1        470.0    470.0      0.0          dtype_indexer = include_these & exclude_these
  3510         1       1530.0   1530.0      0.1          return self.loc[_get_info_slice(self, dtype_indexer)]

datajanko · 2019-09-12T19:17:33Z

Thanks for the hint. I'll have a look into this.

datajanko · 2019-09-12T21:21:01Z

From your example, we see that the include_these blocks (probably exclude_these as well) take the longest. The starmap iteration over each column is inefficient. Actually, we only need to do this per dtype in self.dtypes. So we would have at most something like 30 hits there. I'll work on the issue asap

datajanko · 2019-09-13T20:03:42Z

Okay, for small data, this can be easily improved:

         include
0       0.001359
10      0.002055
100     0.012956
1000    0.077123
10000   0.689586
100000  7.288168

changes to

          include
0        0.001990
10       0.001958
100      0.001683
1000     0.008553
10000    0.136975
100000  15.086613

Note that the last line looks awful, and the second line looks nice. What did I do:
Starting after Line 3505

def is_dtype_instance_mapper(dtype, ids):
    return functools.partial(issubclass, dtype.type), ids

dtypes_ids = {}
for idx, dtype in enumerate(self.dtypes):
    dtypes_ids[dtype] = dtypes_ids.get(dtype, []) + [idx]

for f, ids in itertools.starmap(
    is_dtype_instance_mapper, dtypes_ids.items()
        ):
    if include:  # checks for the case of empty include or exclude
        include_these.iloc[ids] = any(map(f, include))
    if exclude:
        exclude_these.iloc[ids] = not any(map(f, exclude))

So obviously, rewriting the values of the dict and appending one item to a list does not scale well, here. A different approach I'll try next is to just groupby and getDummies on the types (would be nice if we would have a dtype index :)). I'd guess that this is already more optimized.

On a slightly different note: I'm not able to install line_profiler in the environment provided by the environment.yml. Should I raise an issue there? Besides, shall I create a WIP pull request?

TomAugspurger · 2019-09-13T20:40:21Z

Hmm I'm not sure. Are you pip or conda installing it? It does have a C extension, not sure if they have a wheel.

A WIP PR is just fine. Make sure to include a new ASV with a wide-ish DataFrame.

datajanko · 2019-09-13T20:45:34Z

I tried both ways to install it, without success. I'll attach an asv probably tomorrow.

TomAugspurger mentioned this issue Sep 6, 2019

read_csv get stuck for wide dataframe (2 million rows, 500k columns) dask/dask#5365

Closed

TomAugspurger added Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Sep 10, 2019

TomAugspurger added this to the Contributions Welcome milestone Sep 10, 2019

datajanko mentioned this issue Sep 14, 2019

[PERF] Vectorize select_dtypes #28447

Merged

7 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Sep 17, 2019

jreback closed this as completed in #28447 Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DataFrame.select_dtypes scaling to wide data frames #28317

Improve DataFrame.select_dtypes scaling to wide data frames #28317

TomAugspurger commented Sep 6, 2019 •

edited

Loading

datajanko commented Sep 7, 2019

datajanko commented Sep 12, 2019 •

edited

Loading

TomAugspurger commented Sep 12, 2019

datajanko commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

datajanko commented Sep 12, 2019

datajanko commented Sep 12, 2019

datajanko commented Sep 13, 2019

TomAugspurger commented Sep 13, 2019

datajanko commented Sep 13, 2019

Improve DataFrame.select_dtypes scaling to wide data frames #28317

Improve DataFrame.select_dtypes scaling to wide data frames #28317

Comments

TomAugspurger commented Sep 6, 2019 • edited Loading

datajanko commented Sep 7, 2019

datajanko commented Sep 12, 2019 • edited Loading

TomAugspurger commented Sep 12, 2019

datajanko commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019

datajanko commented Sep 12, 2019

datajanko commented Sep 12, 2019

datajanko commented Sep 13, 2019

TomAugspurger commented Sep 13, 2019

datajanko commented Sep 13, 2019

TomAugspurger commented Sep 6, 2019 •

edited

Loading

datajanko commented Sep 12, 2019 •

edited

Loading