-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: conversion of long python objects to string when using numexpr #40848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Numexpr is an optional dependency and Listed here: https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#recommended-dependencies |
Thanks a lot for the reply! I was just looking at pip show pandas for
dependencies, was not aware they are listed there.
I guess the main point I wanted to raise was -- is it a good idea to use a
non-compulsory dependency library without notifying the user that it's
being used? It makes bugs introduced by optional dependencies really hard
to track, especially when the user hasn't manually installed the
pandas-optional dependency because it was a required dependency of another
library.
But if you guys think it's the best way to go :) Thanks again for all your
work!
…On Fri, Apr 9, 2021 at 6:48 PM patrick ***@***.***> wrote:
Numexpr is an optional dependency and Listed here:
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#recommended-dependencies
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40848 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGKHRJY5MWRJELHKLC22BLTH4VULANCNFSM42VBISOQ>
.
|
I am not quite sure what you would expect here? How should we notify users that a dependency is used apart from documenting this? |
Could you provide something reproducible for the infinite loop? |
OK, so after further investigating the issue, here is reproducing code.
Note that the issue will disappear by pip uninstall numexpr
import sys, pandas as pd, numpy as np
np.set_printoptions(threshold = sys.maxsize)
x = pd.DataFrame({'a':np.array([0,np.nan,None])})
def f(n):
try:
x2 = x.sample(n,replace = True)
x2.query('a.isnull() == False')
except:
pass
%time f(100000)
Wall time: 341 ms
%time f(200000)
Wall time: 3.25 s
%time f(300000)
Wall time: 11.1 s
So it's actually not an *infinite* loop, but our table had 3 million rows,
and it scales over-linearly
The suppressed error is:
File "C:\Users\user\anaconda3\lib\site-packages\pandas\core\frame.py",
line 3469, in query
res = self.eval(expr, **kwargs)
File "C:\Users\user\anaconda3\lib\site-packages\pandas\core\frame.py",
line 3599, in eval
return _eval(expr, inplace=inplace, **kwargs)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\eval.py",
line 347, in eval
ret = eng_inst.evaluate()
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\engines.py",
line 73, in evaluate
res = self._evaluate()
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\engines.py",
line 113, in _evaluate
_check_ne_builtin_clash(self.expr)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\engines.py",
line 29, in _check_ne_builtin_clash
names = expr.names
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\expr.py",
line 826, in names
return frozenset(term.name for term in com.flatten(self.terms))
TypeError: unhashable type: 'numpy.ndarray'
Appears to be related to:
https://stackoverflow.com/questions/51878316/pandas-python-series-objects-are-mutable-thus-they-cannot-be-hashed-in-query-me
Hope this helps! :)
…On Fri, Apr 9, 2021 at 7:06 PM patrick ***@***.***> wrote:
Could you provide something reproducible for the infinite loop?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40848 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGKHRLXLYRK526YTFLT243TH4XYHANCNFSM42VBISOQ>
.
|
Forgot to add -- error on CTRL+C
Traceback (most recent call last):
File "<ipython-input-15-371eac7162d6>", line 1, in <module>
f(1000000)
File "<ipython-input-9-8dd07e2969e1>", line 3, in f
x2.query('a.isnull() == False')
File "C:\Users\user\anaconda3\lib\site-packages\pandas\core\frame.py", line
3469, in query
res = self.eval(expr, **kwargs)
File "C:\Users\user\anaconda3\lib\site-packages\pandas\core\frame.py", line
3599, in eval
return _eval(expr, inplace=inplace, **kwargs)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\eval.py",
line 347, in eval
ret = eng_inst.evaluate()
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\engines.py",
line 73, in evaluate
res = self._evaluate()
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\engines.py",
line 109, in _evaluate
s = self.convert()
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\engines.py",
line 55, in convert
return printing.pprint_thing(self.expr)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\io\formats\printing.py",
line 235, in pprint_thing
result = as_escaped_string(thing)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\io\formats\printing.py",
line 211, in as_escaped_string
result = str(thing)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\expr.py",
line 808, in __repr__
return printing.pprint_thing(self.terms)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\io\formats\printing.py",
line 235, in pprint_thing
result = as_escaped_string(thing)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\io\formats\printing.py",
line 211, in as_escaped_string
result = str(thing)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\ops.py",
line 220, in __repr__
return pprint_thing(f" {self.op} ".join(parened))
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\ops.py",
line 219, in <genexpr>
parened = (f"({pprint_thing(opr)})" for opr in self.operands)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\io\formats\printing.py",
line 235, in pprint_thing
result = as_escaped_string(thing)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\io\formats\printing.py",
line 211, in as_escaped_string
result = str(thing)
File
"C:\Users\user\anaconda3\lib\site-packages\pandas\core\computation\ops.py",
line 193, in __repr__
return repr(self.name)
File "C:\Users\user\anaconda3\lib\site-packages\numpy\core\arrayprint.py",
line 1413, in _array_repr_implementation
lst = array2string(arr, max_line_width, precision, suppress_small,
File "C:\Users\user\anaconda3\lib\site-packages\numpy\core\arrayprint.py",
line 692, in array2string
return _array2string(a, options, separator, prefix)
File "C:\Users\user\anaconda3\lib\site-packages\numpy\core\arrayprint.py",
line 468, in wrapper
return f(self, *args, **kwargs)
File "C:\Users\user\anaconda3\lib\site-packages\numpy\core\arrayprint.py",
line 501, in _array2string
lst = _formatArray(a, format_function, options['linewidth'],
File "C:\Users\user\anaconda3\lib\site-packages\numpy\core\arrayprint.py",
line 845, in _formatArray
return recurser(index=(),
File "C:\Users\user\anaconda3\lib\site-packages\numpy\core\arrayprint.py",
line 802, in recurser
s, line = _extendLine_pretty(
File "C:\Users\user\anaconda3\lib\site-packages\numpy\core\arrayprint.py",
line 715, in _extendLine_pretty
return _extendLine(s, line, word, line_width, next_line_prefix, legacy)
File "C:\Users\user\anaconda3\lib\site-packages\numpy\core\arrayprint.py",
line 705, in _extendLine
line += word
…On Fri, Apr 9, 2021 at 7:06 PM patrick ***@***.***> wrote:
Could you provide something reproducible for the infinite loop?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40848 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGKHRLXLYRK526YTFLT243TH4XYHANCNFSM42VBISOQ>
.
|
As potential solutions: could you make numexpr appear in pip show pandas? Just thinking ways of making people's lives easier, as it's a tricky bug |
On my machine, running with I'd agree the error message and the stack isn't so helpful here, is that what you mean when you say "tricky bug"? I don't think we can reliably tell when it's not helpful. If that is true, the intercepting and modifying the message may do more harm than good.
Related: #32556 |
Also - I don't think the long output is necessarily helpful for this issue, would you mind removing it? |
Do you have numexpr installed in the env that takes 0.58s for 10m rows?
…On Sat, Apr 10, 2021 at 1:30 PM Richard Shadrach ***@***.***> wrote:
OK, so after further investigating the issue, here is reproducing code.
Note that the issue will disappear by pip uninstall numexpr
%time f(300000)
Wall time: 11.1 s
So it's actually not an *infinite* loop, but our table had 3 million rows,
and it scales over-linearly
On my machine, running with 10_000_000 takes 0.58s, I'd be curious to
understand why it's 11.1s on yours for 300_000. If you remove the query
line, what is the execution time for you then?
I'd agree the error message and the stack isn't so helpful here, is that
what you mean when you say "tricky bug"? I don't think we can reliably tell
when it's not helpful. If that is true, the intercepting and modifying the
message may do more harm than good.
pip show is showing requirements, and numexpr is not a requirement,
right? Not sure I understand what the suggestion is here.
Related: #32556 <#32556>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40848 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGKHRMCBJJTLBZ4FFQCD23TIAZG7ANCNFSM42VBISOQ>
.
|
Yes - 2.7.3 |
Hmm.... OK, so I create a new conda environment, install numpy and pandas only, and run a .py file containing the code I showed above for 300,000 rows
Then I just install numexpr and run it again
I'm on Windows 10, but I tried doing the same thing on ubuntu, and I get the same result. If you still can't reproduce, well, I guess I can give up :) |
Exactly which output? I have two comments with long error messages :) |
|
Odd, I do not think it should not take that much time for the query to fail. If you're able to profile and find out where this time in query is being spent, it would be helpful, but I understand if that's too much effort. Perhaps someone else can reproduce. |
For me, the function is actually raising an error (but it's catched in |
cc @simonjayhawkins in case you can reproduce this on Windows (in an env with numexpr installed) |
Can reproduce on win10. machine: 4 cores 40gb ram Takes 16 seconds to run through. profile:
everything else is more or less not relevant. INSTALLED VERSIONScommit : 963cf2b pandas : 1.1.0.dev0+4095.g963cf2b5a |
Thanks @jorisvandenbossche @phofl. Previously I made the mistake of not including |
Yep, it's getting stuck in formatting the error message, it's bouncing
between those top three functions in the profiling. And it looks like the
number of calls scale linearly with the number of rows; and I bet the
object being acted is also proportional in size with the number of rows, so
each function call takes more than constant time. Meaning the whole thing
scales super-linearly. Call it for 3,000,000 rows now, and it should take
forever.
It's not a big deal, but it's just a *really* hard bug to find, and can be
triggered by something as easy as another package installing numexpr
…On Sat, Apr 10, 2021 at 10:23 PM Richard Shadrach ***@***.***> wrote:
Thanks @jorisvandenbossche <https://github.com/jorisvandenbossche> @phofl
<https://github.com/phofl>. Previously I made the mistake of not
including np.set_printoptions(threshold = np.inf); with this included I
can reproduce the long runtime.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40848 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGKHRJGUIJ54AIGMWX4643TICXUBANCNFSM42VBISOQ>
.
|
It's not formatting the error message, it's formatting the expression; in convert:
When I run with the default NumPy print threshold, the string is
|
Ah OK, I just assumed that since np.set_printoptions is necessary to make
it slow, it might have to do with what it's printing, and the only thing it
prints in the end is an error; but I guess I was wrong :)
…On Sat, Apr 10, 2021 at 10:30 PM Richard Shadrach ***@***.***> wrote:
It's not formatting the error message, it's formatting the expression; in
convert:
def convert(self) -> str:
"""
Convert an expression for evaluation.
Defaults to return the expression as a string.
"""
return printing.pprint_thing(self.expr)
When I run with the default NumPy print threshold, the string is
(array([False, False, True, ..., True, False, False])) == (False)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40848 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGKHRL2EM3FIPMYWVJMRQDTICYORANCNFSM42VBISOQ>
.
|
For this example, the |
Also, are there hashable python objects that would have long string reps that we can do something about? |
I have checked that this issue has not already been reported.
[ X ] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
# Your code here
Problem description
[this should explain why the current behaviour is a problem and why the expected output is a better solution]
Expected Output
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here leaving a blank line after the details tag]The text was updated successfully, but these errors were encountered: