Skip to content

MLP Tutorials v1.0 #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nickjbrowning
Copy link
Contributor

No description provided.

Copy link

preview available: https://docs.tds.cscs.ch/85

Copy link

preview available: https://docs.tds.cscs.ch/85

Comment on lines +95 to +98
[env]
FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"
NCCL_DEBUG = "INFO"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These environment variables need not be set manually anymore. Instead maybe refer to [ref-communication-nccl] for more details.

Comment on lines +72 to +75
[env]
FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"
NCCL_DEBUG = "INFO"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor

@henrique henrique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +38 to +44
```
FROM nvcr.io/nvidia/pytorch:24.01-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```
FROM nvcr.io/nvidia/pytorch:24.01-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
```
!!! example "Dockerfile"
FROM nvcr.io/nvidia/pytorch:24.01-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to give the code block a title and file type:

Suggested change
```
FROM nvcr.io/nvidia/pytorch:24.01-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
```
```dockerfile title="Dockerfile"
FROM nvcr.io/nvidia/pytorch:24.01-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
```

Comment on lines +17 to +26
### Set up Permissions for the Nvidia NGC Catalog

Some [Nvidia NGC](https://www.nvidia.com/en-us/gpu-cloud) containers can only be downloaded with a valid API token, so we need to set one up. Create an account and setup your API token in the [Nvidia NGC container catalog](https://catalog.ngc.nvidia.com). Then, use your favorite text editor to create a credentials file `~/.config/enroot/.credentials` for enroot. Enroot will be responsible for fetching the container image from NGC behind the scenes. The credentials file should look like this:

```
machine nvcr.io login $oauthtoken password <API-TOKEN>
```

Make sure to replace `<API-TOKEN>` with your actual token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Set up Permissions for the Nvidia NGC Catalog
Some [Nvidia NGC](https://www.nvidia.com/en-us/gpu-cloud) containers can only be downloaded with a valid API token, so we need to set one up. Create an account and setup your API token in the [Nvidia NGC container catalog](https://catalog.ngc.nvidia.com). Then, use your favorite text editor to create a credentials file `~/.config/enroot/.credentials` for enroot. Enroot will be responsible for fetching the container image from NGC behind the scenes. The credentials file should look like this:
```
machine nvcr.io login $oauthtoken password <API-TOKEN>
```
Make sure to replace `<API-TOKEN>` with your actual token.

Do we actually need this? I think I was never able to setup this as described... I'd remove the whole subsection ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think I never needed this either

Comment on lines +96 to +97
FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"

boeschf:
These environment variables need not be set manually anymore. Instead maybe refer to [ref-communication-nccl] for more details.


Cool, now you have a working container with PyTorch and all the necessary Python packages installed! Let's move on to Gemma-7B. We write a Python script `$SCRATCH/gemma-inference/gemma-inference.py` to load the model and prompt it with some custom text. The Python script should look like this:

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```
```python title="$SCRATCH/gemma-inference/gemma-inference.py"

Comment on lines +73 to +74
FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"

@@ -0,0 +1,170 @@
[](){#ref-mlp-llm-finetuning-tutorial}

# LLM Finetuning Tutorial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look at https://eth-cscs.github.io/cscs-docs/contributing/#style-guide for some general guidelines on styling/formatting. We're not aiming for perfect, nothing is blocking, but we try to move towards some consistency.

Comment on lines +17 to +23
```
[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference
[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash
user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate
(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1
# ... pip output ...
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: if you'd like this to be easily copy-pasteable, I'd recommend removing the prompts:

Suggested change
```
[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference
[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash
user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate
(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1
# ... pip output ...
```
```bash
cd $SCRATCH/gemma-inference
srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash
source ./gemma-venv/bin/activate
python -m pip install peft==0.11.1
```

though the prompts might be important in this case since they show the context. In that case:

Suggested change
```
[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference
[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash
user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate
(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1
# ... pip output ...
```
```console
[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference
[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash
user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate
(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1
# ... pip output ...
```

for syntax highlighting of the prompts (though pygments isn't smart enough to recognize the [...]$ as a prompt unfortunately; it does recognize simpler prompts though... no best solution here 🤷).

Comment applies to all the code blocks. Apply as you see fit.

Comment on lines +38 to +44
```
FROM nvcr.io/nvidia/pytorch:24.01-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to give the code block a title and file type:

Suggested change
```
FROM nvcr.io/nvidia/pytorch:24.01-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
```
```dockerfile title="Dockerfile"
FROM nvcr.io/nvidia/pytorch:24.01-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/*
```

# ... more output here ...
```

where you should replace `<ACCOOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where you should replace `<ACCOOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile:
where you should replace `<ACCOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile:


### Set up an EDF

We need to set up an EDF (Environment Definition File) which tells the Container Engine what container to load, where to mount it, and what plugins to load. Use your favorite text editor to create a file `~/.edf/gemma-pytorch.toml` for the container engine. The EDF should look like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Collaborating in Git

In order to track and exchange your progress with colleagues, it is recommended to store the EDF, Dockerfile and your application code alongside in a Git repository in a directory on `$SCRATCH` and share it with colleagues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directory on $SCRATCH and share it with colleagues.

Is this a good suggestion with the cleanup policy in place? $PROJECT? Not sure what's the best place to share... just recommend a git repo without mentioning where to store it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants