MLP Tutorials v1.0 #85

nickjbrowning · 2025-04-10T13:27:08Z

No description provided.

github-actions · 2025-04-10T13:27:57Z

preview available: https://docs.tds.cscs.ch/85

github-actions · 2025-04-10T13:29:23Z

preview available: https://docs.tds.cscs.ch/85

boeschf · 2025-04-10T14:40:06Z

docs/guides/mlp_tutorials/llm-inference.md

+[env]
+FI_CXI_DISABLE_HOST_REGISTER = "1"
+FI_MR_CACHE_MONITOR = "userfaultfd"
+NCCL_DEBUG = "INFO"


These environment variables need not be set manually anymore. Instead maybe refer to [ref-communication-nccl] for more details.

boeschf · 2025-04-10T14:42:08Z

docs/guides/mlp_tutorials/llm-nanotron-training.md

+[env]
+FI_CXI_DISABLE_HOST_REGISTER = "1"
+FI_MR_CACHE_MONITOR = "userfaultfd"
+NCCL_DEBUG = "INFO"


henrique

lgtm

henrique · 2025-04-10T15:52:13Z

docs/guides/mlp_tutorials/llm-inference.md

+```
+FROM nvcr.io/nvidia/pytorch:24.01-py3
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
+```


Suggested change

```

FROM nvcr.io/nvidia/pytorch:24.01-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*

```

!!! example "Dockerfile"

FROM nvcr.io/nvidia/pytorch:24.01-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*

Probably better to give the code block a title and file type:

Suggested change

```

FROM nvcr.io/nvidia/pytorch:24.01-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*

```

```dockerfile title="Dockerfile"

FROM nvcr.io/nvidia/pytorch:24.01-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*

```

henrique · 2025-04-10T15:53:32Z

docs/guides/mlp_tutorials/llm-inference.md

+###  Set up Permissions for the Nvidia NGC Catalog
+
+Some [Nvidia NGC](https://www.nvidia.com/en-us/gpu-cloud) containers can only be downloaded with a valid API token, so we need to set one up. Create an account and setup your API token in the [Nvidia NGC container catalog](https://catalog.ngc.nvidia.com). Then, use your favorite text editor to create a credentials file `~/.config/enroot/.credentials` for enroot. Enroot will be responsible for fetching the container image from NGC behind the scenes. The credentials file should look like this:
+
+```
+machine nvcr.io login $oauthtoken password <API-TOKEN>
+```
+
+Make sure to replace `<API-TOKEN>` with your actual token.
+


Suggested change

### Set up Permissions for the Nvidia NGC Catalog

Some [Nvidia NGC](https://www.nvidia.com/en-us/gpu-cloud) containers can only be downloaded with a valid API token, so we need to set one up. Create an account and setup your API token in the [Nvidia NGC container catalog](https://catalog.ngc.nvidia.com). Then, use your favorite text editor to create a credentials file `~/.config/enroot/.credentials` for enroot. Enroot will be responsible for fetching the container image from NGC behind the scenes. The credentials file should look like this:

```

machine nvcr.io login $oauthtoken password <API-TOKEN>

```

Make sure to replace `<API-TOKEN>` with your actual token.

Do we actually need this? I think I was never able to setup this as described... I'd remove the whole subsection ?

yeah, I think I never needed this either

henrique · 2025-04-10T15:58:26Z

docs/guides/mlp_tutorials/llm-inference.md

+FI_CXI_DISABLE_HOST_REGISTER = "1"
+FI_MR_CACHE_MONITOR = "userfaultfd"


Suggested change

FI_CXI_DISABLE_HOST_REGISTER = "1"

FI_MR_CACHE_MONITOR = "userfaultfd"

boeschf:
These environment variables need not be set manually anymore. Instead maybe refer to [ref-communication-nccl] for more details.

henrique · 2025-04-10T16:01:55Z

docs/guides/mlp_tutorials/llm-inference.md

+
+Cool, now you have a working container with PyTorch and all the necessary Python packages installed! Let's move on to Gemma-7B.  We write a Python script `$SCRATCH/gemma-inference/gemma-inference.py` to load the model and prompt it with some custom text. The Python script should look like this:
+
+```


Suggested change

```

```python title="$SCRATCH/gemma-inference/gemma-inference.py"

henrique · 2025-04-10T16:02:49Z

docs/guides/mlp_tutorials/llm-nanotron-training.md

+FI_CXI_DISABLE_HOST_REGISTER = "1"
+FI_MR_CACHE_MONITOR = "userfaultfd"


Suggested change

FI_CXI_DISABLE_HOST_REGISTER = "1"

FI_MR_CACHE_MONITOR = "userfaultfd"

msimberg · 2025-04-11T07:00:16Z

docs/guides/mlp_tutorials/llm-finetuning.md

@@ -0,0 +1,170 @@
+[](){#ref-mlp-llm-finetuning-tutorial}
+
+# LLM Finetuning Tutorial


Have a look at https://eth-cscs.github.io/cscs-docs/contributing/#style-guide for some general guidelines on styling/formatting. We're not aiming for perfect, nothing is blocking, but we try to move towards some consistency.

msimberg · 2025-04-11T07:37:54Z

docs/guides/mlp_tutorials/llm-finetuning.md

+```
+[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference
+[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash
+user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate
+(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1
+# ... pip output ...
+```


Minor: if you'd like this to be easily copy-pasteable, I'd recommend removing the prompts:

Suggested change

```

[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference

[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash

user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate

(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1

# ... pip output ...

```

```bash

cd $SCRATCH/gemma-inference

srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash

source ./gemma-venv/bin/activate

python -m pip install peft==0.11.1

```

though the prompts might be important in this case since they show the context. In that case:

Suggested change

```

[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference

[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash

user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate

(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1

# ... pip output ...

```

```console

[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference

[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash

user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate

(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1

# ... pip output ...

```

for syntax highlighting of the prompts (though pygments isn't smart enough to recognize the [...]$ as a prompt unfortunately; it does recognize simpler prompts though... no best solution here 🤷).

Comment applies to all the code blocks. Apply as you see fit.

msimberg · 2025-04-11T07:40:06Z

docs/guides/mlp_tutorials/llm-inference.md

+```
+FROM nvcr.io/nvidia/pytorch:24.01-py3
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
+```


Probably better to give the code block a title and file type:

Suggested change

```

FROM nvcr.io/nvidia/pytorch:24.01-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*

```

```dockerfile title="Dockerfile"

FROM nvcr.io/nvidia/pytorch:24.01-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*

```

msimberg · 2025-04-11T07:41:49Z

docs/guides/mlp_tutorials/llm-inference.md

+# ... more output here ...
+```
+
+where you should replace `<ACCOOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile:


Suggested change

where you should replace `<ACCOOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile:

where you should replace `<ACCOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile:

msimberg · 2025-04-11T07:42:11Z

docs/guides/mlp_tutorials/llm-inference.md

+
+### Set up an EDF
+
+We need to set up an EDF (Environment Definition File) which tells the Container Engine what container to load, where to mount it, and what plugins to load. Use your favorite text editor to create a file `~/.edf/gemma-pytorch.toml` for the container engine. The EDF should look like this:


Maybe link to https://eth-cscs.github.io/cscs-docs/software/container-engine/#concept (overview of EDF)?

msimberg · 2025-04-11T07:43:55Z

docs/guides/mlp_tutorials/llm-inference.md

+
+### Collaborating in Git
+
+In order to track and exchange your progress with colleagues, it is recommended to store the EDF, Dockerfile and your application code alongside in a Git repository in a directory on `$SCRATCH` and share it with colleagues.


directory on $SCRATCH and share it with colleagues.

Is this a good suggestion with the cleanup policy in place? $PROJECT? Not sure what's the best place to share... just recommend a git repo without mentioning where to store it?

added mlp_tutorials

1e835ca

nickjbrowning requested review from bcumming, msimberg and RMeli as code owners April 10, 2025 13:27

typo.

2c9e5c5

boeschf reviewed Apr 10, 2025

View reviewed changes

henrique approved these changes Apr 10, 2025

View reviewed changes

msimberg reviewed Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLP Tutorials v1.0 #85

MLP Tutorials v1.0 #85

nickjbrowning commented Apr 10, 2025

github-actions bot commented Apr 10, 2025

github-actions bot commented Apr 10, 2025

boeschf Apr 10, 2025

boeschf Apr 10, 2025

henrique left a comment

henrique Apr 10, 2025

msimberg Apr 11, 2025

henrique Apr 10, 2025

boeschf Apr 11, 2025

henrique Apr 10, 2025

henrique Apr 10, 2025

henrique Apr 10, 2025

msimberg Apr 11, 2025

msimberg Apr 11, 2025

msimberg Apr 11, 2025

msimberg Apr 11, 2025

msimberg Apr 11, 2025

msimberg Apr 11, 2025

		FI_CXI_DISABLE_HOST_REGISTER = "1"
		FI_MR_CACHE_MONITOR = "userfaultfd"


		Cool, now you have a working container with PyTorch and all the necessary Python packages installed! Let's move on to Gemma-7B. We write a Python script `$SCRATCH/gemma-inference/gemma-inference.py` to load the model and prompt it with some custom text. The Python script should look like this:

		```

	```
	```python title="$SCRATCH/gemma-inference/gemma-inference.py"

		@@ -0,0 +1,170 @@
		[](){#ref-mlp-llm-finetuning-tutorial}

		# LLM Finetuning Tutorial

	where you should replace `<ACCOOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile:
	where you should replace `<ACCOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile:


		### Set up an EDF

		We need to set up an EDF (Environment Definition File) which tells the Container Engine what container to load, where to mount it, and what plugins to load. Use your favorite text editor to create a file `~/.edf/gemma-pytorch.toml` for the container engine. The EDF should look like this:


		### Collaborating in Git

		In order to track and exchange your progress with colleagues, it is recommended to store the EDF, Dockerfile and your application code alongside in a Git repository in a directory on `$SCRATCH` and share it with colleagues.

MLP Tutorials v1.0 #85

Are you sure you want to change the base?

MLP Tutorials v1.0 #85

Conversation

nickjbrowning commented Apr 10, 2025

github-actions bot commented Apr 10, 2025

github-actions bot commented Apr 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henrique left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment