-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Optimization memory leak #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is the main code in your optimize_test guarded against re-entry? I.e. can you confirm your optimize_test is not a fork bomb?
|
This is the code I am using for testing and seeing the behavior. Very simple. I see it spawn up the number of processes equal to the number of CPU cores (as you describe). HTOP shows them gradually consuming more memory as time passes until the box crashes and I have to reboot it from the EC2 console. |
Well, Python's Could be numpy/numpy#12122. Can you check and update your dependencies? What Python version?
Could be https://bugs.python.org/issue27144. You can try with applied this workaround in the loop here: backtesting.py/backtesting/backtesting.py Lines 826 to 828 in 41b1ddf
Could be https://bugs.python.org/issue29842. Maybe you can try switching our executor for this one https://github.com/mowshon/bounded_pool_executor. backtesting.py/backtesting/backtesting.py Line 823 in 41b1ddf
Thanks for helping debugging this. Per your file, you're searching for best of 36100 parameter configurations. You can reduce that by passing instead: short_sma_period=range(10, 200, 5) # 5-step increments This won't prevent inherent overfitting, but should reduce the workload 20-fold. |
Hi Kernc, I am running Pandas 1.0.2, numpy 1.18.1 and Python 3.7.4. https://bugs.python.org/issue27144 did not resolve the issue. Switching to the bounded_pool_executor did slow down the memory leak but didn't stop it. Interestingly looking at HTOP, I see processes for the number of CPUs I have initially launch, and then after a period of time another pair of processes will spawn. This spawning new processes continues indefinitely as long as the routine is running. I am able to reproduce this on a non-EC2 machine too btw. Understand about limiting the parameters. I used those high values in my script for the purpose of reproducing the issue I am seeing. Something else I can look at? |
This, again, leaves me thinking the error is likened to a fork bomb. In your file, I don't see definitions for
How short a period?
Did you happen to try with I'm not sure whether memory profiling or tracing will help in this multiprocess environment. What if you switch multiprocessing start method to 'forkserver'? I.e. |
Hi Kernc, can you show me the block of Backetesting.py code with "future._result = None" correctly applied? The get_initial_variables() is just a routine that takes arguments from the command line to determine how many days worth of data to back test, period (hour, 4h, day), etc, so I don't have to modify the file every time I want to make a change. get_forex_data() basically runs pd.read_csv() and passes various filenames to the routine for me based on command line args. I will remove the module references and statically define in my main() function to see if that fixes it. |
Should be here: backtesting.py/backtesting/backtesting.py Lines 826 to 828 in 41b1ddf
future._result = None inserted into |
#!/home/ubuntu/venv3/bin/python
from backtesting import Backtest, Strategy
from backtesting.test import SMA, GOOG
class TestStrategy(Strategy):
short_sma_period = 5
short_close_period = 5
def init(self):
import talib
import pandas as pd
self.SMA = self.I(talib.SMA, pd.Series(self.data.Close), self.short_sma_period)
self.CLOSE_SMA = self.I(talib.SMA, pd.Series(self.data.Close), self.short_close_period)
def next(self):
from backtesting.lib import crossover
if crossover(self.data.Close, self.SMA):
self.buy()
if self.position and crossover(self.CLOSE_SMA, self.data.Close):
self.position.close()
def main():
bt = Backtest(GOOG, TestStrategy, cash=10000, commission=.000)
stats, heatmap = bt.optimize(short_sma_period=range(10,200),short_close_period=range(10,200),return_heatmap=True)
print (stats, heatmap)
if __name__ == "__main__":
main() I did use a different EC2 VM tonight, however the HD snapshot is the same. |
You can probably further strip the example down by noop-ing
Does it also keep creating child processes? I think your best bet is to inquire why there are so many child processes created. How many CPU cores does your AWS instance have? There should never be more child processes at any one time than there are CPU cores available (value of I think the problem might be AWS-specific. You could ask on StackOverflow, tagging the question |
Thanks. I will try on another platform. I won't have time to test again this evening to confirm whether or not additional child processes are still being created. Do you have a known working good platform? Thanks again for the help. |
I got a same memory leakage problem on my Mac. from backtesting import Backtest, Strategy
from backtesting.lib import plot_heatmaps
import pandas as pd
def roll_max(arr: pd.Series, n: int) -> pd.Series:
return pd.Series(arr).rolling(n).max().shift()
def roll_min(arr: pd.Series, n: int) -> pd.Series:
return pd.Series(arr).rolling(n).min().shift()
class SmaCross(Strategy):
n_long = 255
n_short = 255
def init(self):
self.max_price = self.I(roll_max, self.data.High, self.n_long)
self.min_price = self.I(roll_min, self.data.Low, self.n_short)
def next(self):
if self.data.Close[-1] > self.max_price[-1]:
self.buy()
elif self.data.Close[-1] < self.min_price[-1]:
self.sell()
if __name__ == '__main__':
BTC = pd.read_csv('adjusted_BTC_5m.csv', index_col=1, parse_dates=True, infer_datetime_format=True)
bt = Backtest(BTC, SmaCross, commission=.002)
result, heatmap = bt.optimize(
n_long=range(100, 500, 10),
n_short=range(100, 500, 10),
constraint=lambda p: abs(p.n_long - p.n_short) < 300,
return_heatmap=True,
)
plot_heatmaps(heatmap) just a test strategy, but it ran out all my memory |
Right, so the issue seems to be forced parallelization, spawning many processes (not an issue), with each process loading its own separate copy of input data (← issue). The two alternative first ideas for workarounds are:
|
Really Thx, it seems work, I would have a try. |
I certainly more than welcome an exploration of the two options above (and a consequential PR), preferably in sequence. If that's what you mean. 😃 |
With: # test.py:
import pandas as pd
from backtesting import Backtest
from backtesting.test._test import SmaCross
if __name__ == '__main__':
df = pd.read_pickle('/tmp/ohlc_50k.pickle')
bt = Backtest(df, SmaCross)
res = bt.optimize(fast=range(10), slow=range(10))
print(res) Before: % time python test.py
python test.py 216.97s user 2.55s system 682% cpu 800M mem 32.165s total After 1faa7e8: % time python test.py
python test.py 218.09s user 1.22s system 685% cpu 141M mem 32.015s total Certainly might help. If not, reopen. |
... by passing less objects around (less pickling) Fixes: kernc/backtesting.py#51
When running an Optimize routine, it appears as though the child processes that are created aren't releasing memory after they finish. If I write a simple routine that just evaluates multiple SMA values for a closing price crossover (doesn't really matter what) the routine will gradually consume all memory available on the machine and then produces a "raised a BrokenProcessPool error : A process in the process pool was terminated abruptly while the future was running or pending." error. I am currently running in AWS on an Ubuntu box.
Thanks in advance.
Beginning of the routine:

After 5 minutes:

Backtesting 0.1.2
The text was updated successfully, but these errors were encountered: