Skip to content

PERF: conversion of long python objects to string when using numexpr #40848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks
gabiteodoru opened this issue Apr 9, 2021 · 25 comments
Open
2 tasks
Labels
Error Reporting Incorrect or improved errors from pandas expressions pd.eval, query Performance Memory or execution speed performance

Comments

@gabiteodoru
Copy link

  • I have checked that this issue has not already been reported.

  • [ X ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

@gabiteodoru gabiteodoru added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2021
@phofl
Copy link
Member

phofl commented Apr 9, 2021

Numexpr is an optional dependency and Listed here: https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#recommended-dependencies

@gabiteodoru
Copy link
Author

gabiteodoru commented Apr 9, 2021 via email

@phofl
Copy link
Member

phofl commented Apr 9, 2021

I am not quite sure what you would expect here? How should we notify users that a dependency is used apart from documenting this?

@phofl
Copy link
Member

phofl commented Apr 9, 2021

Could you provide something reproducible for the infinite loop?

@mzeitlin11
Copy link
Member

#34044, #27639 maybe related

@gabiteodoru
Copy link
Author

gabiteodoru commented Apr 9, 2021 via email

@gabiteodoru
Copy link
Author

gabiteodoru commented Apr 9, 2021 via email

@gabiteodoru
Copy link
Author

As potential solutions: could you make numexpr appear in pip show pandas?
Or if not, what would you think about printing "using numexpr" on import like tensorflow prints stuff when importing.

Just thinking ways of making people's lives easier, as it's a tricky bug

@rhshadrach
Copy link
Member

OK, so after further investigating the issue, here is reproducing code.
Note that the issue will disappear by pip uninstall numexpr

%time f(300000)
Wall time: 11.1 s

So it's actually not an infinite loop, but our table had 3 million rows,
and it scales over-linearly

On my machine, running with 10_000_000 takes 0.58s, I'd be curious to understand why it's 11.1s on yours for 300_000. If you remove the query line, what is the execution time for you then?

I'd agree the error message and the stack isn't so helpful here, is that what you mean when you say "tricky bug"? I don't think we can reliably tell when it's not helpful. If that is true, the intercepting and modifying the message may do more harm than good.

pip show is showing requirements, and numexpr is not a requirement, right? Not sure I understand what the suggestion is here.

Related: #32556

@rhshadrach
Copy link
Member

Also - I don't think the long output is necessarily helpful for this issue, would you mind removing it?

@rhshadrach rhshadrach added Error Reporting Incorrect or improved errors from pandas expressions pd.eval, query Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2021
@gabiteodoru
Copy link
Author

gabiteodoru commented Apr 10, 2021 via email

@rhshadrach
Copy link
Member

Yes - 2.7.3

@gabiteodoru
Copy link
Author

gabiteodoru commented Apr 10, 2021

Hmm.... OK, so I create a new conda environment, install numpy and pandas only, and run a .py file containing the code I showed above for 300,000 rows

(myenv) C:\Users\user>conda list
# packages in environment at C:\Users\user\anaconda3\envs\myenv:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl
ca-certificates           2021.1.19            haa95532_1
certifi                   2020.12.5        py38haa95532_0
intel-openmp              2020.2                      254
mkl                       2020.2                      256
mkl-service               2.3.0            py38h196d8e1_0
mkl_fft                   1.3.0            py38h46781fe_0
mkl_random                1.1.1            py38h47e9c7a_0
numpy                     1.19.2           py38hadc3359_0
numpy-base                1.19.2           py38ha3acd2a_0
openssl                   1.1.1k               h2bbff1b_0
pandas                    1.2.3            py38hf11a4ad_0
pip                       21.0.1           py38haa95532_0
python                    3.8.8                hdbf39b2_4
python-dateutil           2.8.1              pyhd3eb1b0_0
pytz                      2021.1             pyhd3eb1b0_0
setuptools                52.0.0           py38haa95532_0
six                       1.15.0           py38haa95532_0
sqlite                    3.35.4               h2bbff1b_0
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
wheel                     0.36.2             pyhd3eb1b0_0
wincertstore              0.2                      py38_0

(myenv) C:\Users\user>python d:\temp\pandabug.py
0.04986834526062012

Then I just install numexpr and run it again

(myenv) C:\Users\user>conda list
# packages in environment at C:\Users\user\anaconda3\envs\myenv:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl
ca-certificates           2021.1.19            haa95532_1
certifi                   2020.12.5        py38haa95532_0
intel-openmp              2020.2                      254
mkl                       2020.2                      256
mkl-service               2.3.0            py38h196d8e1_0
mkl_fft                   1.3.0            py38h46781fe_0
mkl_random                1.1.1            py38h47e9c7a_0
numexpr                   2.7.3            py38hcbcaa1e_0
numpy                     1.19.2           py38hadc3359_0
numpy-base                1.19.2           py38ha3acd2a_0
openssl                   1.1.1k               h2bbff1b_0
pandas                    1.2.3            py38hf11a4ad_0
pip                       21.0.1           py38haa95532_0
python                    3.8.8                hdbf39b2_4
python-dateutil           2.8.1              pyhd3eb1b0_0
pytz                      2021.1             pyhd3eb1b0_0
setuptools                52.0.0           py38haa95532_0
six                       1.15.0           py38haa95532_0
sqlite                    3.35.4               h2bbff1b_0
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
wheel                     0.36.2             pyhd3eb1b0_0
wincertstore              0.2                      py38_0

(myenv) C:\Users\user>python d:\temp\pandabug.py
11.516265153884888

I'm on Windows 10, but I tried doing the same thing on ubuntu, and I get the same result. If you still can't reproduce, well, I guess I can give up :)

@gabiteodoru
Copy link
Author

Also - I don't think the long output is necessarily helpful for this issue, would you mind removing it?

Exactly which output? I have two comments with long error messages :)

@gabiteodoru
Copy link
Author

gabiteodoru commented Apr 10, 2021

pandabug.py for completeness

import sys, pandas as pd, numpy as np, time
np.set_printoptions(threshold = np.inf)
x = pd.DataFrame({'a':np.array([0,np.nan,None])})
def f(n):
    try:
        x2 = x.sample(n,replace = True)
        x2.query('a.isnull() == False')
    except:
        pass
t = time.time()
f(300000)
print(time.time()-t)

@rhshadrach
Copy link
Member

Odd, I do not think it should not take that much time for the query to fail. If you're able to profile and find out where this time in query is being spent, it would be helpful, but I understand if that's too much effort. Perhaps someone else can reproduce.

@rhshadrach rhshadrach added Performance Memory or execution speed performance and removed Enhancement labels Apr 10, 2021
@jorisvandenbossche
Copy link
Member

For me, the function is actually raising an error (but it's catched in f(n), so you don't directly see it). Is that correct?
If so, there might be going something wrong with raising the error (it might be trying to print the values in the error message, which can blow up).

@jorisvandenbossche
Copy link
Member

cc @simonjayhawkins in case you can reproduce this on Windows (in an env with numexpr installed)

@phofl
Copy link
Member

phofl commented Apr 10, 2021

Can reproduce on win10. machine: 4 cores 40gb ram

Takes 16 seconds to run through.

profile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   300000    9.662    0.000    9.721    0.000 arrayprint.py:695(_extendLine)
 300001/1    2.074    0.000   11.845   11.845 arrayprint.py:718(recurser)
1500278/1500234    0.070    0.000    0.070    0.000 {built-in method builtins.len}
   300000    0.033    0.000    0.033    0.000 arrayprint.py:1142(__call__)
        1    0.011    0.011    0.011    0.011 {built-in method pandas._libs.missing.isnaobj}
    33334    0.005    0.000    0.005    0.000 {method 'rstrip' of 'str' objects}
        1    0.003    0.003    0.004    0.004 {method 'choice' of 'numpy.random.mtrand.RandomState' objects}
        4    0.003    0.001    0.003    0.001 {method 'copy' of 'numpy.ndarray' objects}
        1    0.002    0.002    0.002    0.002 {method 'take' of 'numpy.ndarray' objects}
        1    0.002    0.002    0.002    0.002 {pandas._libs.algos.take_2d_axis0_object_object}
        7    0.001    0.000    0.001    0.000 {built-in method numpy.empty}
        1    0.001    0.001    0.001    0.001 indexers.py:225(maybe_convert_indices)
       20    0.001    0.000    0.001    0.000 {built-in method numpy.array}
        1    0.001    0.001    0.012    0.012 missing.py:244(_isna_string_dtype)
        3    0.001    0.000   11.847    3.949 ops.py:219(<genexpr>)
        1    0.001    0.001   11.846   11.846 arrayprint.py:1365(_array_repr_implementation)
        1    0.001    0.001   11.882   11.882 <ipython-input-4-7727d8aaa8d6>:4(f)
       16    0.001    0.000   11.847    0.740 {method 'join' of 'str' objects}
        1    0.001    0.001   11.883   11.883 <string>:1(<module>)
        1    0.001    0.001    0.008    0.008 managers.py:1426(take)

everything else is more or less not relevant.

INSTALLED VERSIONS

commit : 963cf2b
python : 3.7.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.0.dev0+4095.g963cf2b5a
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.1
hypothesis : 6.0.2
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : 0.5.0
gcsfs : None
matplotlib : 3.3.3
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : 0.5.2
scipy : 1.6.0
sqlalchemy : 1.3.22
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.52.0

@rhshadrach
Copy link
Member

Thanks @jorisvandenbossche @phofl. Previously I made the mistake of not including np.set_printoptions(threshold = np.inf); with this included I can reproduce the long runtime.

@gabiteodoru
Copy link
Author

gabiteodoru commented Apr 10, 2021 via email

@rhshadrach
Copy link
Member

It's not formatting the error message, it's formatting the expression; in convert:

    def convert(self) -> str:
        """
        Convert an expression for evaluation.

        Defaults to return the expression as a string.
        """
        return printing.pprint_thing(self.expr)

When I run with the default NumPy print threshold, the string is

(array([False, False,  True, ...,  True, False, False])) == (False)

@gabiteodoru
Copy link
Author

gabiteodoru commented Apr 10, 2021 via email

@rhshadrach
Copy link
Member

For this example, the a.isnull() gets parsed as an instance of pandas.core.computation.ops.Constant. Is it true that when using numexpr, any instance of Constant must be hashable? We could catch that here, which occurs before any conversion to string.

@rhshadrach rhshadrach changed the title BUG: pandas uses numexpr if it exists, numexpr is buggy and enters infinite loop on .query('col.isna()'), Anaconda ships with numexpr installed by default, numexpr isn't listed as a requirement of pandas PERF: conversion of long python objects to string when using numexpr Apr 11, 2021
@rhshadrach
Copy link
Member

Also, are there hashable python objects that would have long string reps that we can do something about?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas expressions pd.eval, query Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants