Skip to content

Ensure isinstance checks use ABC classes where available #27353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
WillAyd opened this issue Jul 12, 2019 · 15 comments
Closed

Ensure isinstance checks use ABC classes where available #27353

WillAyd opened this issue Jul 12, 2019 · 15 comments
Labels

Comments

@WillAyd
Copy link
Member

WillAyd commented Jul 12, 2019

Here's a quick little script I wrote to check for things like isinstance(obj, MultiIndex)

import ast
import pathlib


# Taken from pandas.core.dtypes.generic
BAD_NAMES = {'Index', 'Int64Index', 'UInt64Index', 'RangeIndex',
             'Float64Index', 'MultiIndex', 'DatetimeIndex', 'TimedeltaIndex',
             'PeriodIndex', 'CategoricalIndex', 'IntervalIndex', 'IndexClass',
             'Series', 'DataFrame', 'SparseDataFrame', 'SparseSeries',
             'SparseArray', 'Categorical', 'DatetimeArray', 'TimedeltaArray',
             'PeriodArray', 'Period', 'DateOffset', 'Interval', 'ExtensionArray',
             'PandasArray'}

def check_name_or_attr(filename, node):
    if isinstance(node, ast.Name):  # direct name
        if node.id in BAD_NAMES:
            print(f"{filename}:{node.lineno} isinstance check of {node.id}")
    elif isinstance(node, ast.Attribute):  # ex: pd.DataFrame
        if node.attr in BAD_NAMES:
            print(f"{filename}:{node.lineno} isinstance check of {node.attr}")
    

for path in pathlib.Path('.').glob('**/*.py'):
    filename = path.resolve()

    with open(path) as fh:
        contents = fh.read()
        tree = ast.parse(contents)

        for node in ast.walk(tree):
            try:
                isinstance_check = node.value.func.id == 'isinstance'
            except AttributeError:
                continue

            if isinstance_check:
                types = node.value.args[1]

                if isinstance(types, ast.Tuple):
                    for elt in types.elts:
                        check_name_or_attr(filename, elt)
                else:
                    check_name_or_attr(filename, types)

Running that from project root yields all of the following errors:

/Users/williamayd/clones/pandas/pandas/core/frame.py:4681 isinstance check of MultiIndex
/Users/williamayd/clones/pandas/pandas/core/indexing.py:2499 isinstance check of MultiIndex
/Users/williamayd/clones/pandas/pandas/core/indexing.py:564 isinstance check of MultiIndex
/Users/williamayd/clones/pandas/pandas/core/reshape/tile.py:540 isinstance check of Series
/Users/williamayd/clones/pandas/pandas/core/reshape/concat.py:373 isinstance check of DataFrame
/Users/williamayd/clones/pandas/pandas/core/reshape/concat.py:377 isinstance check of Series
/Users/williamayd/clones/pandas/pandas/core/computation/align.py:72 isinstance check of Series
/Users/williamayd/clones/pandas/pandas/core/arrays/period.py:931 isinstance check of Period
/Users/williamayd/clones/pandas/pandas/core/arrays/period.py:932 isinstance check of Period
/Users/williamayd/clones/pandas/pandas/core/indexes/base.py:3626 isinstance check of MultiIndex
/Users/williamayd/clones/pandas/pandas/core/indexes/base.py:3627 isinstance check of MultiIndex
/Users/williamayd/clones/pandas/pandas/io/formats/excel.py:453 isinstance check of Index
/Users/williamayd/clones/pandas/pandas/io/formats/excel.py:501 isinstance check of Index
/Users/williamayd/clones/pandas/pandas/io/formats/excel.py:552 isinstance check of Index
/Users/williamayd/clones/pandas/pandas/io/formats/excel.py:592 isinstance check of Index

So I think these need to be swapped out for their ABC equivalents

@jbrockmendel
Copy link
Member

If they aren't inducing otherwise-unnecessary runtime imports, the non-ABC versions are fine, aren't they?

FWIW I benchmarked these a long time ago and isinstance(x, ABCFoo) is about twice the time as isinstance(x, Foo), though both numbers are tiny.

@WillAyd
Copy link
Member Author

WillAyd commented Jul 12, 2019

Yea valid points but I'm not sure what percent of these they would apply to. Particular with pandas.core.frame (the first issue) we are only import MultiIndex for is instancechecks when we actually already have the ABCMultiIndex in there as well, so that at least is a very good candidate to change

Noted on performance but we are still in the range of nanoseconds so I think OK unless stuck in a performance critical loop

@jreback
Copy link
Contributor

jreback commented Jul 12, 2019

would be +1 in making imports simpler / more consistent

@WillAyd WillAyd added this to the Contributions Welcome milestone Jul 15, 2019
@souvik3333
Copy link
Contributor

Hi @WillAyd, can you explain bit more about it? I am new here and interested in it.

@WillAyd
Copy link
Member Author

WillAyd commented Jul 16, 2019

Sure - for any of the failures at the bottom of the OP there should be an ABC class defined in pandas.core.dtypes.generic that you could use instead, especially if we only import the objects in the OP for isinstance checks

@addisonlynch
Copy link
Contributor

@WillAyd I think I can tackle this...there may be a number of additional instances that the script is missing because it's not parsing multi-line statements. Consider pandas/core/frame.py (script has identified the below line):

multi_col = isinstance(self.columns, MultiIndex)

But since the ast parser can only handle simple statements, multi-line statements like the following are ignored:

pandas/pandas/core/frame.py

Lines 1737 to 1739 in 5d9fd7e

if isinstance(self.index, MultiIndex):
# array of tuples to numpy cols. copy copy copy
ix_vals = list(map(np.array, zip(*self.index.values)))

In this file, MultiIndex is indeed only imported for the isinstance checks. Just wanted to confirm with you before updating the script (or perhaps you could do so, I'm not an expert with ast so I'm not sure how to tokenize the multi-line statements such as the above)

@WillAyd
Copy link
Member Author

WillAyd commented Aug 26, 2019

Thanks for looking. The script is probably secondary to actually making the changes so if you’ve looked and would like to submit a PR can take things from there

@simonjayhawkins
Copy link
Member

when using the ABCs in isinstance checks the expressions resolves to Any.

so every PR citing this issue is reducing the type checking and not allowing mypy to narrow types.

should we have ABCs that are not created dynamically at runtime or should we only use the ABCs where necessary to eliminate the need for a cast.

the contributing guide would suggest the later in the short term until we have a solution.. "the use of cast is strongly discouraged. Where applicable a refactor of the code to appease static analysis is preferable"

@simonjayhawkins
Copy link
Member

should we have ABCs that are not created dynamically at runtime or should we only use the ABCs where necessary to eliminate the need for a cast.

also could add types to pandas.core.dtypes.generic using cast or add a stub file for pandas.core.dtypes.generic

adding a stub file would be inconsistent with our current approach for a standard internal python module, but is straightforward if we don't need to type check pandas.core.dtypes.generic itself. (it's mainly dynamic in nature).

@WillAyd
Copy link
Member Author

WillAyd commented Sep 17, 2019

FWIW I benchmarked these a long time ago and isinstance(x, ABCFoo) is about twice the time as isinstance(x, Foo), though both numbers are tiny.

By the way @jbrockmendel what you describe might have been resolved in Py37:

https://bugs.python.org/issue31333

Haven't benchmarked personally just sharing as I came across

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 12, 2019

By the way @jbrockmendel what you describe might have been resolved in Py37:

I don't think so. On py 37:

In [58]: %timeit isinstance(df, pd.DataFrame)
67 ns ± 1.15 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [59]: %timeit isinstance(df, ABCDataFrame)
261 ns ± 0.89 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

(so in this case even worse than the factor 2 Brock mentioned before)
For me that is good reason to not categorically always prefer or convert to ABCs. Although it is talking about nanoseconds, we do it very often and it can add up (no concrete example, but I think I see regularly profiles where isinstance check are a significant part of the time)

@kgoehner
Copy link

The current list of files returned by @WillAyd's script after my PR:

/Users/kgoehner/repos/pandas/pandas/core/reshape/tile.py:542 isinstance check of Series
/Users/kgoehner/repos/pandas/pandas/core/reshape/concat.py:372 isinstance check of DataFrame
/Users/kgoehner/repos/pandas/pandas/core/computation/align.py:73 isinstance check of Series
/Users/kgoehner/repos/pandas/pandas/core/arrays/period.py:999 isinstance check of Period
/Users/kgoehner/repos/pandas/pandas/core/arrays/period.py:1000 isinstance check of Period
/Users/kgoehner/repos/pandas/pandas/io/formats/excel.py:453 isinstance check of Index
/Users/kgoehner/repos/pandas/pandas/io/formats/excel.py:501 isinstance check of Index
/Users/kgoehner/repos/pandas/pandas/io/formats/excel.py:552 isinstance check of Index
/Users/kgoehner/repos/pandas/pandas/io/formats/excel.py:592 isinstance check of Index

In all of these files the non ABC classes references are also created or a class method is used so I didn't update them. Based on the data above I'm not sure we want to update these. What other action is left for this issue?

@WillAyd
Copy link
Member Author

WillAyd commented Oct 16, 2019

If that's the remaining items then your PR can close this once merged. Thanks for reviewing @Kazz47

@jbrockmendel
Copy link
Member

Although it is talking about nanoseconds, we do it very often and it can add up (no concrete example, but I think I see regularly profiles where isinstance check are a significant part of the time)

running cProfile over the test suite and sorting by "tottime", isinstance checks are our single biggest cost.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
329396359/329390647   69.347    0.000  107.594    0.000 {built-in method builtins.isinstance}
79506637/79497230   40.886    0.000   41.715    0.000 {built-in method builtins.hasattr}
17055279/17055270   39.755    0.000   39.756    0.000 {method 'reduce' of 'numpy.ufunc' objects}
22943824/22507082   30.802    0.000   39.418    0.000 {built-in method numpy.array}
172919761/171433042   28.380    0.000   41.216    0.000 {built-in method builtins.getattr}
84338244/84338225   26.088    0.000   35.814    0.000 pandas/core/dtypes/generic.py:10(_check)
  1241533   22.396    0.000  175.924    0.000 pandas/core/dtypes/missing.py:357(array_equivalent)
     8781   20.510    0.002   20.510    0.002 {built-in method builtins.compile}
2696852/54   20.450    0.000 1561.251   28.912 pluggy/callers.py:157(_multicall)
14446760/14446728   16.972    0.000   58.695    0.000 numpy/core/fromnumeric.py:73(_wrapreduction)
19009362/17375521   15.377    0.000  137.850    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
   511782   14.742    0.000   14.742    0.000 {method 'truncate' of '_io.FileIO' objects}
 26600241   14.092    0.000   28.120    0.000 ast.py:193(iter_child_nodes)
   136561   13.818    0.000   13.962    0.000 llvmlite/binding/ffi.py:112(__call__)
59616663/49294213   11.565    0.000   20.035    0.000 {built-in method builtins.len}
18484360/18479868   10.954    0.000   54.042    0.000 pandas/core/dtypes/base.py:254(is_dtype)
12306254/12306251   10.727    0.000   24.984    0.000 pandas/core/dtypes/common.py:1602(_is_dtype_type)
9830603/9830392   10.669    0.000   20.154    0.000 pandas/core/dtypes/dtypes.py:98(find)
2696652/2   10.114    0.000 1561.249  780.625 pluggy/hooks.py:275(__call__)
11147463/7260063    8.680    0.000  140.070    0.000 {built-in method builtins.next}
9685486/9685470    8.465    0.000   26.545    0.000 pandas/core/dtypes/common.py:1462(is_extension_array_dtype)
1585598/1553849    8.399    0.000  110.800    0.000 pandas/core/series.py:203(__init__)
773117/758965    8.133    0.000   15.753    0.000 pandas/tseries/offsets.py:94(wrapper)
634273/307348    7.904    0.000   51.150    0.000 pandas/core/indexes/base.py:284(__new__)

@simonjayhawkins
Copy link
Member

@WillAyd closing this due to performance and typing concerns. feel free to reopen if you want the discussion to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants