-
Notifications
You must be signed in to change notification settings - Fork 18
MLP Tutorials v1.0 #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
preview available: https://docs.tds.cscs.ch/85 |
preview available: https://docs.tds.cscs.ch/85 |
[env] | ||
FI_CXI_DISABLE_HOST_REGISTER = "1" | ||
FI_MR_CACHE_MONITOR = "userfaultfd" | ||
NCCL_DEBUG = "INFO" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These environment variables need not be set manually anymore. Instead maybe refer to [ref-communication-nccl] for more details.
[env] | ||
FI_CXI_DISABLE_HOST_REGISTER = "1" | ||
FI_MR_CACHE_MONITOR = "userfaultfd" | ||
NCCL_DEBUG = "INFO" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
``` | ||
FROM nvcr.io/nvidia/pytorch:24.01-py3 | ||
|
||
ENV DEBIAN_FRONTEND=noninteractive | ||
|
||
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``` | |
FROM nvcr.io/nvidia/pytorch:24.01-py3 | |
ENV DEBIAN_FRONTEND=noninteractive | |
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* | |
``` | |
!!! example "Dockerfile" | |
FROM nvcr.io/nvidia/pytorch:24.01-py3 | |
ENV DEBIAN_FRONTEND=noninteractive | |
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to give the code block a title and file type:
``` | |
FROM nvcr.io/nvidia/pytorch:24.01-py3 | |
ENV DEBIAN_FRONTEND=noninteractive | |
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* | |
``` | |
```dockerfile title="Dockerfile" | |
FROM nvcr.io/nvidia/pytorch:24.01-py3 | |
ENV DEBIAN_FRONTEND=noninteractive | |
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* | |
``` |
### Set up Permissions for the Nvidia NGC Catalog | ||
|
||
Some [Nvidia NGC](https://www.nvidia.com/en-us/gpu-cloud) containers can only be downloaded with a valid API token, so we need to set one up. Create an account and setup your API token in the [Nvidia NGC container catalog](https://catalog.ngc.nvidia.com). Then, use your favorite text editor to create a credentials file `~/.config/enroot/.credentials` for enroot. Enroot will be responsible for fetching the container image from NGC behind the scenes. The credentials file should look like this: | ||
|
||
``` | ||
machine nvcr.io login $oauthtoken password <API-TOKEN> | ||
``` | ||
|
||
Make sure to replace `<API-TOKEN>` with your actual token. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Set up Permissions for the Nvidia NGC Catalog | |
Some [Nvidia NGC](https://www.nvidia.com/en-us/gpu-cloud) containers can only be downloaded with a valid API token, so we need to set one up. Create an account and setup your API token in the [Nvidia NGC container catalog](https://catalog.ngc.nvidia.com). Then, use your favorite text editor to create a credentials file `~/.config/enroot/.credentials` for enroot. Enroot will be responsible for fetching the container image from NGC behind the scenes. The credentials file should look like this: | |
``` | |
machine nvcr.io login $oauthtoken password <API-TOKEN> | |
``` | |
Make sure to replace `<API-TOKEN>` with your actual token. |
Do we actually need this? I think I was never able to setup this as described... I'd remove the whole subsection ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think I never needed this either
FI_CXI_DISABLE_HOST_REGISTER = "1" | ||
FI_MR_CACHE_MONITOR = "userfaultfd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FI_CXI_DISABLE_HOST_REGISTER = "1" | |
FI_MR_CACHE_MONITOR = "userfaultfd" |
boeschf:
These environment variables need not be set manually anymore. Instead maybe refer to [ref-communication-nccl] for more details.
|
||
Cool, now you have a working container with PyTorch and all the necessary Python packages installed! Let's move on to Gemma-7B. We write a Python script `$SCRATCH/gemma-inference/gemma-inference.py` to load the model and prompt it with some custom text. The Python script should look like this: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``` | |
```python title="$SCRATCH/gemma-inference/gemma-inference.py" |
FI_CXI_DISABLE_HOST_REGISTER = "1" | ||
FI_MR_CACHE_MONITOR = "userfaultfd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FI_CXI_DISABLE_HOST_REGISTER = "1" | |
FI_MR_CACHE_MONITOR = "userfaultfd" |
@@ -0,0 +1,170 @@ | |||
[](){#ref-mlp-llm-finetuning-tutorial} | |||
|
|||
# LLM Finetuning Tutorial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look at https://eth-cscs.github.io/cscs-docs/contributing/#style-guide for some general guidelines on styling/formatting. We're not aiming for perfect, nothing is blocking, but we try to move towards some consistency.
``` | ||
[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference | ||
[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash | ||
user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate | ||
(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1 | ||
# ... pip output ... | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: if you'd like this to be easily copy-pasteable, I'd recommend removing the prompts:
``` | |
[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference | |
[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash | |
user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate | |
(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1 | |
# ... pip output ... | |
``` | |
```bash | |
cd $SCRATCH/gemma-inference | |
srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash | |
source ./gemma-venv/bin/activate | |
python -m pip install peft==0.11.1 | |
``` |
though the prompts might be important in this case since they show the context. In that case:
``` | |
[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference | |
[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash | |
user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate | |
(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1 | |
# ... pip output ... | |
``` | |
```console | |
[cluster][user@cluster-ln001 gemma-inference]$ cd $SCRATCH/gemma-inference | |
[cluster][user@cluster-ln001 gemma-inference]$ srun --environment=gemma-pytorch --container-workdir=$PWD --pty bash | |
user@nid001234:/bret/scratch/cscs/user/gemma-inference$ source ./gemma-venv/bin/activate | |
(gemma-venv) user@nid001234:/bret/scratch/cscs/user/gemma-inference$ python -m pip install peft==0.11.1 | |
# ... pip output ... | |
``` |
for syntax highlighting of the prompts (though pygments isn't smart enough to recognize the [...]$
as a prompt unfortunately; it does recognize simpler prompts though... no best solution here 🤷).
Comment applies to all the code blocks. Apply as you see fit.
``` | ||
FROM nvcr.io/nvidia/pytorch:24.01-py3 | ||
|
||
ENV DEBIAN_FRONTEND=noninteractive | ||
|
||
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to give the code block a title and file type:
``` | |
FROM nvcr.io/nvidia/pytorch:24.01-py3 | |
ENV DEBIAN_FRONTEND=noninteractive | |
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* | |
``` | |
```dockerfile title="Dockerfile" | |
FROM nvcr.io/nvidia/pytorch:24.01-py3 | |
ENV DEBIAN_FRONTEND=noninteractive | |
RUN apt-get update && apt-get install -y python3.10-venv && apt-get clean && rm -rf /var/lib/apt/lists/* | |
``` |
# ... more output here ... | ||
``` | ||
|
||
where you should replace `<ACCOOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where you should replace `<ACCOOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile: | |
where you should replace `<ACCOUNT>` with your project account ID. At this point, you can exit the SLURM allocation by typing `exit`. You should be able to see a new squashfile next to your Dockerfile: |
|
||
### Set up an EDF | ||
|
||
We need to set up an EDF (Environment Definition File) which tells the Container Engine what container to load, where to mount it, and what plugins to load. Use your favorite text editor to create a file `~/.edf/gemma-pytorch.toml` for the container engine. The EDF should look like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to https://eth-cscs.github.io/cscs-docs/software/container-engine/#concept (overview of EDF)?
|
||
### Collaborating in Git | ||
|
||
In order to track and exchange your progress with colleagues, it is recommended to store the EDF, Dockerfile and your application code alongside in a Git repository in a directory on `$SCRATCH` and share it with colleagues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directory on
$SCRATCH
and share it with colleagues.
Is this a good suggestion with the cleanup policy in place? $PROJECT
? Not sure what's the best place to share... just recommend a git repo without mentioning where to store it?
No description provided.