Skip to content

Optimization memory leak #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sumobull opened this issue Mar 16, 2020 · 15 comments
Closed

Optimization memory leak #51

sumobull opened this issue Mar 16, 2020 · 15 comments
Labels
bug Something isn't working

Comments

@sumobull
Copy link

When running an Optimize routine, it appears as though the child processes that are created aren't releasing memory after they finish. If I write a simple routine that just evaluates multiple SMA values for a closing price crossover (doesn't really matter what) the routine will gradually consume all memory available on the machine and then produces a "raised a BrokenProcessPool error : A process in the process pool was terminated abruptly while the future was running or pending." error. I am currently running in AWS on an Ubuntu box.

Thanks in advance.

Beginning of the routine:
image

After 5 minutes:
image

  • Backtesting version:
    Backtesting 0.1.2
@kernc
Copy link
Owner

kernc commented Mar 16, 2020

Is the main code in your optimize_test guarded against re-entry? I.e. can you confirm your optimize_test is not a fork bomb?

concurrent.futures.ProcessPoolExecutor() used in backtesting optimization defaults to os.cpu_count() processes, and that should expectedly be 8 with your AWS instance. In order to get the strategy object into the child processes, the object is pickled and unpickled at the other end. Unpickling (re)imports the relevant modules, meaning any surrounding global code is (re-)executed.

@sumobull
Copy link
Author

This is the code I am using for testing and seeing the behavior. Very simple.

I see it spawn up the number of processes equal to the number of CPU cores (as you describe). HTOP shows them gradually consuming more memory as time passes until the box crashes and I have to reboot it from the EC2 console.

test_strategy.txt

@kernc
Copy link
Owner

kernc commented Mar 16, 2020

Well, Python's ProcessPoolExecutor we offload work to apparently isn't unknown to memory leaks.

Could be numpy/numpy#12122. Can you check and update your dependencies? What Python version?

pip install -U pandas numpy backtesting

Could be https://bugs.python.org/issue27144. You can try with applied this workaround in the loop here:

for future in _tqdm(as_completed(futures), total=len(futures)):
for params, stats in future.result():
heatmap[tuple(params.values())] = maximize(stats)

Could be https://bugs.python.org/issue29842. Maybe you can try switching our executor for this one https://github.com/mowshon/bounded_pool_executor.

with ProcessPoolExecutor() as executor:

Thanks for helping debugging this.


Per your file, you're searching for best of 36100 parameter configurations. You can reduce that by passing instead:

short_sma_period=range(10, 200, 5)  # 5-step increments

This won't prevent inherent overfitting, but should reduce the workload 20-fold.

@kernc kernc added the upstream Issue affects a dependency of ours label Mar 16, 2020
@kernc kernc changed the title Optimize Routine Memory Leak Optimization memory leak Mar 16, 2020
@sumobull
Copy link
Author

sumobull commented Mar 17, 2020

Hi Kernc,

I am running Pandas 1.0.2, numpy 1.18.1 and Python 3.7.4.

https://bugs.python.org/issue27144 did not resolve the issue.

Switching to the bounded_pool_executor did slow down the memory leak but didn't stop it.

Interestingly looking at HTOP, I see processes for the number of CPUs I have initially launch, and then after a period of time another pair of processes will spawn. This spawning new processes continues indefinitely as long as the routine is running.

I am able to reproduce this on a non-EC2 machine too btw.

Understand about limiting the parameters. I used those high values in my script for the purpose of reproducing the issue I am seeing.

Something else I can look at?

@kernc
Copy link
Owner

kernc commented Mar 17, 2020

Interestingly looking at HTOP, I see processes for the number of CPUs I have initially launch, and then after a period of time another pair of processes will spawn. This spawning new processes continues indefinitely as long as the routine is running.

This, again, leaves me thinking the error is likened to a fork bomb. In your file, I don't see definitions for get_initial_variables() and get_forex_data(). Could any of those be at fault? If I replace those lines with my own forex_data = pd.read_csv(...) loading routine and run the (somewhat reduced) example, it plays out fine on a Linux box, creating at most cpu_count processes, as expected. 🤔

after a period of time

How short a period?

Could be https://bugs.python.org/issue27144. You can try with applied this workaround in the loop here:

Did you happen to try with future._result = None too?

I'm not sure whether memory profiling or tracing will help in this multiprocess environment. What if you switch multiprocessing start method to 'forkserver'? I.e. ctx = mp.get_context('forkserver') and pass that as ProcessPoolExecutor(mp_context=ctx)? Kinda shooting in the dark here. 😓

@sumobull
Copy link
Author

Hi Kernc, can you show me the block of Backetesting.py code with "future._result = None" correctly applied?

The get_initial_variables() is just a routine that takes arguments from the command line to determine how many days worth of data to back test, period (hour, 4h, day), etc, so I don't have to modify the file every time I want to make a change. get_forex_data() basically runs pd.read_csv() and passes various filenames to the routine for me based on command line args. I will remove the module references and statically define in my main() function to see if that fixes it.

@kernc
Copy link
Owner

kernc commented Mar 17, 2020

can you show me the block of Backetesting.py code with "future._result = None" correctly applied?

Should be here:

for future in _tqdm(as_completed(futures), total=len(futures)):
for params, stats in future.result():
heatmap[tuple(params.values())] = maximize(stats)

future._result = None

inserted into for future loop as line 829.

@sumobull
Copy link
Author

sumobull commented Mar 17, 2020

feature._result = None inserted in the module. I removed all references to my modules and just did a from backtesting.test import SMA, GOOG and used the data provided. Same result. Keeps consuming memory until available memory is exhausted.

image

#!/home/ubuntu/venv3/bin/python

from backtesting import Backtest, Strategy
from backtesting.test import SMA, GOOG

class TestStrategy(Strategy):
    
    short_sma_period = 5
    short_close_period = 5

    def init(self):
        import talib
        import pandas as pd
        self.SMA = self.I(talib.SMA, pd.Series(self.data.Close), self.short_sma_period)
        self.CLOSE_SMA = self.I(talib.SMA, pd.Series(self.data.Close), self.short_close_period)
                
    def next(self):
        from backtesting.lib import crossover
        if crossover(self.data.Close, self.SMA):
            self.buy()
        if self.position and crossover(self.CLOSE_SMA, self.data.Close):
            self.position.close()

def main():
    bt = Backtest(GOOG, TestStrategy, cash=10000, commission=.000)
    stats, heatmap = bt.optimize(short_sma_period=range(10,200),short_close_period=range(10,200),return_heatmap=True)

    print (stats, heatmap)


if __name__ == "__main__":
    main()

I did use a different EC2 VM tonight, however the HD snapshot is the same.

@kernc
Copy link
Owner

kernc commented Mar 18, 2020

You can probably further strip the example down by noop-ing init and next methods, as they are likely not at issue.

Same result. Keeps consuming memory until available memory is exhausted.

Does it also keep creating child processes? I think your best bet is to inquire why there are so many child processes created. How many CPU cores does your AWS instance have? There should never be more child processes at any one time than there are CPU cores available (value of os.cpu_count()).

I think the problem might be AWS-specific. You could ask on StackOverflow, tagging the question amazon-web-services, to get insight of some of their support engineers. Or you could look into passing a different mp_context=, such as 'forkserver', into ProcessPoolExecutor.

@sumobull
Copy link
Author

Thanks. I will try on another platform. I won't have time to test again this evening to confirm whether or not additional child processes are still being created.

Do you have a known working good platform?

Thanks again for the help.

@XieXiaonan
Copy link

XieXiaonan commented Mar 31, 2020

I got a same memory leakage problem on my Mac.

from backtesting import Backtest, Strategy
from backtesting.lib import plot_heatmaps
import pandas as pd


def roll_max(arr: pd.Series, n: int) -> pd.Series:
    return pd.Series(arr).rolling(n).max().shift()


def roll_min(arr: pd.Series, n: int) -> pd.Series:
    return pd.Series(arr).rolling(n).min().shift()


class SmaCross(Strategy):
    n_long = 255
    n_short = 255

    def init(self):
        self.max_price = self.I(roll_max, self.data.High, self.n_long)
        self.min_price = self.I(roll_min, self.data.Low, self.n_short)

    def next(self):
        if self.data.Close[-1] > self.max_price[-1]:
            self.buy()
        elif self.data.Close[-1] < self.min_price[-1]:
            self.sell()


if __name__ == '__main__':
    BTC = pd.read_csv('adjusted_BTC_5m.csv', index_col=1, parse_dates=True, infer_datetime_format=True)
    bt = Backtest(BTC, SmaCross, commission=.002)
    result, heatmap = bt.optimize(
        n_long=range(100, 500, 10),
        n_short=range(100, 500, 10),
        constraint=lambda p: abs(p.n_long - p.n_short) < 300,
        return_heatmap=True,
    )
    plot_heatmaps(heatmap)

just a test strategy, but it ran out all my memory

@kernc
Copy link
Owner

kernc commented Mar 31, 2020

Right, so the issue seems to be forced parallelization, spawning many processes (not an issue), with each process loading its own separate copy of input data (← issue). The two alternative first ideas for workarounds are:

  1. see if DataFrame can be put in shared memory or if something from here helps: https://stackoverflow.com/questions/22487296/multiprocessing-in-python-sharing-large-object-e-g-pandas-dataframe-between
  2. make parallelization optional.

@XieXiaonan
Copy link

Really Thx, it seems work, I would have a try.
So would you consider adding this part to the project ?

@kernc
Copy link
Owner

kernc commented Mar 31, 2020

I certainly more than welcome an exploration of the two options above (and a consequential PR), preferably in sequence. If that's what you mean. 😃

@kernc kernc closed this as completed in 1faa7e8 Apr 19, 2020
@kernc
Copy link
Owner

kernc commented Apr 19, 2020

With:

# test.py:
import pandas as pd
from backtesting import Backtest
from backtesting.test._test import SmaCross


if __name__ == '__main__':
    df = pd.read_pickle('/tmp/ohlc_50k.pickle')

    bt = Backtest(df, SmaCross)
    res = bt.optimize(fast=range(10), slow=range(10))
    print(res)

Before:

% time python test.py
python test.py  216.97s user  2.55s system  682% cpu  800M mem  32.165s total

After 1faa7e8:

% time python test.py
python test.py  218.09s user  1.22s system  685% cpu  141M mem  32.015s total

Certainly might help. If not, reopen.

@kernc kernc added bug Something isn't working and removed upstream Issue affects a dependency of ours labels Apr 21, 2020
Goblincomet pushed a commit to Goblincomet/forex-trading-backtest that referenced this issue Jul 5, 2023
... by passing less objects around (less pickling)

Fixes: kernc/backtesting.py#51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants