CI: Use all available cores in the CI #24028

datapythonista · 2018-12-01T00:34:14Z

If I'm not wrong, both Travis and Azure provide two cores for the execution of jobs (I think in both cases the number does not necessarily need to be 2, and could be higher, or I guess lower, depending on their load).

So far we are just parallelizing the execution of tests, where we are using 2 processes in most cases (except for the tests marked as single).

We should be able to speed up the CI by detecting the number of cpus available for the job, and parallelize at least (assuming 2 cpus):

Building pandas (adding -j2 to python setup.py build_ext --inplace)
Building the docs (adding --num-jobs=2 to ./doc/make.py html)
Running the tests, with the detected cpus, and where it's not already using multiple processes
Not sure if there is anything else that can be run in more than one core (conda, linting, benchmarks...)

Travis specs: https://docs.travis-ci.com/user/reference/overview/#virtualisation-environment-vs-operating-system

UPDATE:

Based on tests in my local machine, doesn't seem like the -j argument in setup.py build_ext has any effect. I'd say it's ignored, and the process is always single core.

Time to run the tests not slow and not network in the CI:

1st run:

	-j 0	-j auto
azure	1443 sec	804 sec
travis	1362 sec	959 sec

2nd run:

	-j 0	-j auto
azure	1461 sec	839 sec
travis	1421 sec	1009 sec

The text was updated successfully, but these errors were encountered:

bhavaniravi · 2019-08-07T02:22:24Z

This sounds interesting. I can take this up next. But not sure where to start exactly?

datapythonista · 2019-08-07T09:24:41Z

Afaik we're running 3 things in the CI that are worth parallelizing (and I think they weren't when I wrote this issue):

The docs build
Compiling pandas
Running the tests

The parallelization of running the tests is actually very tricky, I've got #26949 to experiment, but let's forget about it first, focus on the others, and we can take care of it later.

The docs are build in the last build in azure if I'm not wrong, and the entry point is doc/make.py, which has a parameter to control the number of parallel sphinx jobs. Not sure what are we calling at the moment, let me know if you can't find it.

For the compilation of pandas, every job is calling ci/setup_env.sh, and this is where it is compiled. Compiling is done by a command like python setup.py build_ext --inplace, which receives a parameter -j for the number of jobs, that if I remember correctly, wasn't working for me locally, was working for some other people, and not sure what was happening in the CI (I think I tried).

Probably the first is pick one of them, see how many jobs are expected to be used, changed the value, and see if the CI is running faster, or what. I think you set up azure for your fork, feel free to experiment there, or directly in pandas, whatever you prefer.

bhavaniravi · 2019-08-22T14:12:06Z

@jreback and @tomascassidy #18052 has updated the documentation contributing.rst python setup instruction to python setup.py build_ext --inplace -j 4.

I am unable to find the functionality of -j in the codebase. Can you help me with this? I would need to understand how it is implemented to work on improving the CI performance

datapythonista · 2019-08-22T14:24:13Z

build_ext is a standard thing from setuptools. Not sure if there is much documentation about it, but should be in their docs (see the mention here: https://pythonhosted.org/an_example_pypi_project/setuptools.html#using-setup-py)

setup.py build_ext --help should provide some information, but -j should be the number of cores to use.

bhavaniravi · 2019-08-23T08:46:37Z

@datapythonista what's your take on adding caching to the build process? Have we tried anything in those lines before? matplotlib/matplotlib#1507

datapythonista · 2019-08-23T10:40:29Z

There was a bit of discussion about it before, and the preference was to keep things simple in the CI.

Personally, to make would make sense to build a docker image or something every night with the latest version of the source code and the conda dependencies, so the builds for every PR just download what it's needed on top of that. But not an expert, I may be missing something.

The best would be to create an issue, so everybody interested can give their opinion, and share our ideas and experience. Feel free to open it (probably better with a specific proposal so the discussion is more focused).

bhavaniravi · 2019-09-09T07:35:31Z

build_ext is a standard thing from setuptools. Not sure if there is much documentation about it, but should be in their docs (see the mention here: https://pythonhosted.org/an_example_pypi_project/setuptools.html#using-setup-py)

setup.py build_ext --help should provide some information, but -j should be the number of cores to use.

@datapythonista it is not a part of setuptools the only other place i see it in whole of internet is in numpy docs. On looking through their code they have a custom implementation for parallel processing. We should be doing the same for "-j to work"

https://github.com/numpy/numpy/blob/42ac121a0b547bc7c218c7ea318caf6c0a420034/numpy/distutils/command/build_ext.py#L34

datapythonista · 2019-09-10T14:25:58Z

build_ext is a command defined in distutils (part of the Python standard library), and extended in setuptools. I don't think there is much documentation, but there is some in the Python docs: https://docs.python.org/3/distutils/configfile.html

Here you've got where the -j is defined: https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Lib/distutils/command/build_ext.py#L87

But if I'm not wrong, setuptools wraps it, so you may need to check it there if this is not the code that is actually used.

datapythonista · 2020-09-24T09:59:57Z

I think this was already implemented with the PYTEST_WORKERS variable: https://github.com/pandas-dev/pandas/blob/master/ci/run_tests.sh#L23

datapythonista added the CI Continuous Integration label Dec 1, 2018

bhavaniravi mentioned this issue Aug 7, 2019

[WIP] Improve CI/CD pipeline performance #27804

Closed

5 tasks

jreback added this to the Contributions Welcome milestone Oct 6, 2019

datapythonista mentioned this issue Dec 26, 2019

Parallel build for CI #30452

Merged

datapythonista closed this as completed Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Use all available cores in the CI #24028

CI: Use all available cores in the CI #24028

datapythonista commented Dec 1, 2018 •

edited

Loading

bhavaniravi commented Aug 7, 2019

datapythonista commented Aug 7, 2019

bhavaniravi commented Aug 22, 2019

datapythonista commented Aug 22, 2019

bhavaniravi commented Aug 23, 2019

datapythonista commented Aug 23, 2019

bhavaniravi commented Sep 9, 2019

datapythonista commented Sep 10, 2019

datapythonista commented Sep 24, 2020

CI: Use all available cores in the CI #24028

CI: Use all available cores in the CI #24028

Comments

datapythonista commented Dec 1, 2018 • edited Loading

bhavaniravi commented Aug 7, 2019

datapythonista commented Aug 7, 2019

bhavaniravi commented Aug 22, 2019

datapythonista commented Aug 22, 2019

bhavaniravi commented Aug 23, 2019

datapythonista commented Aug 23, 2019

bhavaniravi commented Sep 9, 2019

datapythonista commented Sep 10, 2019

datapythonista commented Sep 24, 2020

datapythonista commented Dec 1, 2018 •

edited

Loading