IDT Office Hours Logs

# IDT Office Hours Logs This is a log of the questions / topics discussed in the weekly Office Hours of IDT. **NOTE:** Make sure to first take a look at [the official Mila docs](https://docs.mila.quebec) and [the official Frequently Asked Questions list](https://docs.mila.quebec/Userguide.html#frequently-asked-questions-faqs) before this! # NOTE: MOVED to Confluence page: https://mila-iqia.atlassian.net/wiki/external/1955266567/MjYyY2RkNDlkMzU5NDFjMTkwZTBjMTlhZDMyZjA5N2I?atlOrigin=eyJpIjoiZmEyZjJjNGU5MjM2NDg0ZmExZDcwZjlhOTYzY2UzNTYiLCJwIjoiYyJ9 ------------------------ # November 22nd, 2022 ## Trouble using `mila serve notebook` and `mila code` from a macbook! - When using `mila serve notebook` command, the browser window doesn't connect. (note: changing "localhost" for 127.0.0.1 doesn't fix it.) - When using `mila code`, the vscode window doesn't connect to the remote. (VsCode version 1.73.1 (universal)). - NOTE: Using VsCode to connect to the cluster (login node) without `mila code` also doesn't seem to work! Problem: There was a block of stuff at the start of the .ssh/config file. Removed it, and now everything works. ## Figure isn't showing up with using `plt.show()`. - Temporary solution: save the figure with `plt.savefig` and then open it! ## Using python's `logging` package with Hydra, how to only show logs from my package? Not sure, but perhaps there's an answer here: - https://hydra.cc/docs/tutorials/basic/running_your_app/logging/ - https://hydra.cc/docs/configure_hydra/logging/ ------------------------ # November 15th, 2022 ## Issue with Hydra Submitit launcher plugin: Getting this error: srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive. Potential solution: Take a look at your sweep config to make sure you're not passing two or more flags that set those variables. Might want to take a look at https://slurm.schedmd.com/sbatch.html to find out which flags set which variables. ## How do I get started transfering my workflow from my local machine (macbook) to the Mila cluster? 1. Read through https://docs.mila.quebec/Userguide.html at least once 2. `pip install milatools` 3. `mila init` (go through it) 4. install vscode on your local machine 5. `mila code (some_path) --alloc (some_resources)` 6. `mila serve notebook --alloc (some_resources)` ------------------------ # November 8th, 2022 ## Getting "Unable to connect" when using `mila code` from a Windows machine - Also getting a weird dialog window with "Select OS on host cn-..." (Linux / Windows / MacOS) Part of the solution appears to be to open a folder that is not on the $HOME filesystem (and also to not open the entire $HOME folder with vscode) Inconclusive result. This solution worked for someone, but not for someone else! - Login nodes are also getting hammered atm. Might have something to do with it. ## There is a bug in a package i'm using. How can I change what it is doing (i.e. "patching it") from outside the package? The solution: You can change overwrite everything you want in Python, including classes, methods, functions and variables from all the libraries you use. Practically speaking, you can overwrite the buggy methods / functions / classes of a package with your own: <details> <summary>Replacing a class with your own version:</summary> ```python import torch.optim class MyPatchedAdamOptimizer(torch.optim.Adam): ... torch.optim.Adam = MyPatchedAdamOptimizer ``` </details> <details> <summary>Example for changing a method on a class (e.g. HuggingFace Trainer)</summary> ```python """Patches the `get_optimizer_cls_and_kwargs` staticmethod of the Trainer class to return the MuP Variants of the SGD, Adam, or AdamW optimizers.""" _get_optimizer_cls_and_kwargs = Trainer.get_optimizer_cls_and_kwargs def _custom_get_optimizer_cls_and_kwargs( args: TrainingArguments, ) -> tuple[Callable[..., Optimizer], dict]: # Use the base class to get the optimizer class and kwargs. optimizer_cls, kwargs = _get_optimizer_cls_and_kwargs(args) # Do whatever you want to do ... logger.info(f"Using MuP optimizer: {optimizer_cls} with kwargs: {kwargs}") return optimizer_cls, kwargs Trainer.get_optimizer_cls_and_kwargs = _custom_get_optimizer_cls_and_kwargs ``` </details> <details> <summary>Fixing some backward-incompatible changes in Gym:</summary> ```python ## ======== Start of patch ========= # (Start of Patch (in your code, before any environments are created) import numpy from numpy.random._generator import Generator class MyGenerator(Generator): def randint( self: Generator, low: int, high: int, size=None, dtype="l", endpoint: bool = True, ): """Replacement for `numpy.random.Generator.randint` that uses the `Generator.integers` method instead of `Generator.random_integers` which is deprecated.""" return self.integers(low, high, size=size, dtype=dtype, endpoint=endpoint) # Optional: Overwrite the attribute on the `np.random` module so that any call to # `from np.random import Generator` return the `MyGenerator` class. numpy.random.Generator = MyGenerator # Now, let's fix the ugly backward-incompatible change in gym: ## NOTE: This isn't necessary, because if you take a look at this file: # https://github.com/openai/gym/blob/6a04d49722724677610e36c1f92908e72f51da0c/gym/utils/seeding.py # The last line uses `np.random.Generator` which we already replaced with our `MyGenerator` class # above. So, no need to actually do this here. # import gym.utils.seeding # def _patched_np_random(seed: int = None) -> tuple[MyGenerator, int]: # """Replacement for `gym.utils.seeding.np_random` that uses the `MyGenerator` class instead of # `numpy.random.Generator`. MyGenerator has a `.randint` method so the old code from marlenv # can still work.""" # from gym import error # if seed is not None and not (isinstance(seed, int) and 0 <= seed): # raise error.Error(f"Seed must be a non-negative integer or omitted, not {seed}") # seed_seq = numpy.random.SeedSequence(seed) # np_seed = seed_seq.entropy # # rng = RandomNumberGenerator(numpy.random.PCG64(seed_seq)) # rng = Generator(numpy.random.PCG64(seed_seq)) # return rng, np_seed # # Overwrite the function with your own! # gym.utils.seeding.np_random = _patched_np_random ## ======== End of patch ========= ``` </details> ------------------------ # November 1st, 2022 ## Wandb asks me to login during every run when using sbatch -> Potential solution, adding this to a `~/.bash_aliases` file (which gets loaded by .bashrc, which gets executed at the start of jobs): ```bash export WANDB_API_KEY="YOUR_API_KEY_HERE" ``` This will "log you in" globally. You can find your API key by going to the "Settings" tab in your wandb profile. ## Having trouble setting up passwordless ssh to the Mila cluster from a Windows machine. Issue: `mila init` runs `ssh-copy-id` which isn't available on Windows machines. We're trying to re-create the equivalent of `ssh-copy-id` by adding the contents of ~/.ssh/id_rsa.pub from the local machine to the ~/.ssh/authorized_keys file on the cluster. SOLUTION: Remember to `chmod 600 ~/.ssh/authorized_keys` if not done already! This file should only be readable by you! This is mentioned here: https://docs.mila.quebec/Userguide.html#connecting-to-compute-nodes ## Why does srun --multi-prog run much faster than srun with --n_tasks ? ------------------------ # October 25th, 2022 ## Does anyone have experience deploying a SLURM cluster on Google Cloud? - Not quite, but this tutorial looks interesting: https://cloud.google.com/architecture/deploying-slurm-cluster-compute-engine ## Having trouble getting *large* GPU jobs (runtime of 3 days) allocated on CEDAR - Hmmm. ------------------------ # October 18th, 2022 ## Question: How do I connect to the cluster from a Windows machine? 1. Optional: Install Windows Subsystem for Linux (WSL) to get a linux-like shell on your computer 2. `pip install milatools` 3. `mila init` --> Go through the steps 4. either `mila code some_folder_name` or `ssh mila` ------------------------ # October 11th, 2022 ## libffi.so.6: cannot open shared object file: No such file or directory This is some weird c++ stuff that has been happening for a while. Fix: https://mila-umontreal.slack.com/archives/CFAS8455H/p1659965170663569?thread_ts=1659964343.092399&cid=CFAS8455H - `module load libffi` ------------------------ # October 4th, 2022 ## How to debug code that is running on the cluster? Answer: Take a look at https://docs.mila.quebec/Userguide.html#mila-init - `pip install milatools` - `mila init` - `mila code path/to/folder --alloc resources_you_need` ## How to monitor GPU utilization? - Wandb - nvidia-smi -l (interval in seconds) - ... ## Problem with ComputeCanada: "The repo code I'm using tries to download a dataset, but the compute node doesn't have internet so I get an error." - No easy solution, unfortunately ;( - If you have the dataset already downloaded, does the code still try to access the internet? - Yes. (weird, why? --> unclear) - Your best bet is probably to edit the code of the library yourself (git clone and pip install it in editable mode). ## Trouble using the vscode debugger on the mila cluster Context: *not* using `mila code`. The debugger gives this weird error. The error was that it couldn't activate the conda environment, and that it was on the wrong node. ### Job is "stuck" after a certain amount of time (ex 2 hours): certain plots in wandb (loss) stop updating, while others (resource utilization) keep updating themselves. This code prints stuff to stdout. Inspecting the output file with `tail slurm-xxxx.out -n 10` revealed that the process that was doing the training was being killed by the oom-memory-handler ## Using Orion, I feel like I had to code up a whole bunch of boilerplate on top of it just to have nicely encapsulated experiments / results / etc. Context: Using Orion command-line API with config files. Argument parsing is done with fastargs. - Maybe use the `trial.working_dir` ? - Folder with a .pickle db file, slurm_out folder, slurm_err folder, results folder, the shell file. ## Having trouble using rllib, it seems to be unable to allocate ressources. Getting an error message, saying that the cluster only has 1.0 cpu and 1.0 GPU, and to stop the tuning job and adjust the requested resources. ------------------------ # September 27th, 2022 ## VSCode remote debugger doesn't work with McGill cluster Starting the 'debug' session, it starts, and then just closes. No real logs. - Able to run the file with the "run python file" option. The "Debug python file" option doesn't quite work. Solution: debugpy isn't compatible with python 3.6. Upgrading the python version (via e.g. Conda) solved it. ## How to work on a notebook on the cluster / How to select the environment for a given notebook `mila serve notebook (path to notebook directory here)` This will take you through the setup steps (choosing the modules / environment / installing Jupyter-Notebook, etc). - Unable to install jupiter notebook in the virtualenv! (No resolution. We switchted to Conda and it worked). ### Conda envs weren't being detected. --> Solution: `conda` command wasn't available unless a module was loaded. Hence, running `conda init` added a block of code to .bashrc, which activates the `base` environment by default. Now, vscode is able to run `conda env list` or similar to get the list of conda environments that the user then gets to choose from in the "Select interpreter" drop-down. ------------------------ # September 20th, 2022 (No questions today!) ------------------------ # August 30th, 2022 ## Job running fine in interactive mode fails when using sbatch. (Dataloaders killed) - Seems like the user wasn't passing the `--mem` flag to `mila code`, resulting in the default value for CPU memory (~2Gb or something). - Also seems like the user was passing --gpus=0 to the training script with PyTorch-Lightning, which caused the job to run out of CPU memory. ------------------------ # August 23th, 2022 ## Git: How to "merge" multiple commits into one? I recommend you take a look at the `git rebase --interactive` command. Using it, you can drop, edit, squash, and reorder commits as you see fit. Follow-up question: Context: Editing a notebook that pip installs a repo. Can I avoid having to commit-push everytime I make a change to the code? - Short answer, no. Follow-up suggestions: `git commit -s` `git config core.editor` ## Using VsCode with `mila code`, editor state only persist if running on the same compute node or on a login node When re-opening a directory with `mila code`, the VSCode editor state (open windows, etc) doesn't persist. Potential solutions (for the moment): - Try creating a workspace from your directory. - Write a little script to setup your editor (quite hacky!) ```bash #!/bin/bash # foo.sh code src/train.py code src/eval.py ``` Also, opening a workspace file inside a remote vscode window doesn't seem to work properly. ## Simple-Parsing issue with nested subgroups Created a new issue: https://github.com/lebrice/SimpleParsing/issues/160 ------------------------ # August 16th, 2022 ## Scaling up a network (e.g. to ImageNet) - How to go about doing that? - Adaptive Optimizers? - Fisher Information matrix interaction/link with Adam? - --> Ask on the Optimization slack channel maybe? - How to do ImageNet on the cluster? - https://mila-umontreal.slack.com/archives/CFAS8455H/p1659464809056259?thread_ts=1659386269.124849&cid=CFAS8455H - https://gist.github.com/lebrice/4a67df47d9fca3e199d3e7686396240c - FFCV version (Slower with PL, faster with pytorch): https://gist.github.com/lebrice/37d89c29388d7fc9ce267eed1ba6dbda Example: ```python # from imagenet_datamodule import ImagenetDataModule #datamodule = ImagenetDataModule() from imagenet_ffcv import ImagenetFfcvDataModule datamodule = ImagenetFfcvDataModule() datamodule.prepare_data() train_dataloader = datamodule.train_dataloader() for batch in train_dataloader: ... ``` - How to scale a PyTorch model to multiple GPUS? Two main options IMO: - Use a framework like PyTorch-Lightning and let it take care of that for you - Directly use DataParallel / DistributedDataParallel / etc from pytorch - ## Giving remote access to a local machine ------------- # August 9th, 2022 ## Having trouble understanding how to use the python / command-line API of Orion. There's something confusing about the documentation. "Scaling workers" Confusion: - Python API vs Command-line API? - Workon vs hunt vs suggest/observe? - where does the parallelism happen? - Answer: The storage being the same is the key. but it isn't clear just from the docs. - Pros and cons of the available algorithms. <details> <summary>Example of how to use the python API</summary> ```python # inside main.py def main(): experiment = build_experiment( name="foo", space={ "x": "uniform(0, 1)", }, storage={ "type": "legacy", "database": {"type": "pickleddb", "host": "db.pkl"}, }, algorithms="tpe", max_trials=10, ) while not experiment.is_done: trial = experiment.suggest() lr = trial.params["lr"] params = {} # set fixed parameters. params.update(trial.params) # update with the sampled parameters. # execute the function val_loss = train(**params) experiment.observe(trial, results=[dict(type="objective", name="foo_val_loss", value=val_loss)]) ``` </details> ```console orion hunt -n some_name --lr~loguniform(1e-8, 1e-3, default=3e-4) ``` Weird: Context: Need 2 cpus, 128 Gb of RAM, and 1 GPU (large dataset) - "I hold a compute node hostage" - "I open a screen on the login node" - "Salloc inside the screen" Solution: Can we add --no-cancel-on-close option to mila-code? The idea is that I'd like to re-connect to the same job, and not have to re-schedule it when it gets lost. - Could even do somethign smarter: "If you ask for the same job with the same folder, alloc, etc and we see there is already a job with those parameters, then just connect to that running job instead of creating a new one" -------------- # August 2nd, 2022 ## Can we get a static IP for our group's new computer? -> Ask IT at help Follow-up: Can we use an ethernet port in our Prof's room? - Yes, just let IT know the port number, and if you don't have one in the room, IT will help you out. Follow-up #2: Can we get a public hostname for accessing out computer from the outside ## Any way to get Java on the cluster? Option 1: Send an email to it-support@mila.quebec Option 2: Use Apptainer (a.k.a Singularity) to build an image (on your local machine, since you don't have root on the mila cluster) and then run it on the cluster Follow-up: Why don't use Docker on the cluster? - Short answer: Docker recipes run as root, and can run arbitrary code, as root, on our cluster. This is too dangerous. - We've looked (or will look) at the following alternatives to Docker: - Podman - shifter? ## Using a lot of IO on $HOME, how can I diagnose the problem? `strace -f -e trace=%file PROGRAM_TO_BE_WRAPPED` https://dashboard.server.mila.quebec/d/Cko1h0z7z/beegfs-user-metrics-home-mila?orgId=1&refresh=5s&from=now-1h&to=now -------------- # July 26th, 2022 ## Sharing a server / How to port-forward to a Compute node ## How to ask for a GPU with Mila-code ? Can't get a CPU-only job to be scheduled without asking for a GPU.. Yes, there's currently an issue where you have to ask for a GPU in order for your job to get scheduled. You can add the usual sbatch flags after the --alloc command. ```console mila code Tutorials --alloc --mem=8G --cpus-per-task=8 --gres=gpu:1 ``` ## Could we use the F-01 (tech lab?) for the RL summer school? ## Using FLWR (federated learning), getting a memory leak when computing model accuracy - When storing things in a list, make sure that the tensors are detached (e.g. `predictions.detach()`) ### Part 2: How to know how much memory will be needed for each client? ## Problems with restoring a model using PyTorch-Lightning Context: Using the Seasonal ------------- # July 12th, 2022 IDT Attendees: @normandf and Guillaume Alain ## MLFlow questions: Context: Currently using both PyTorch-Lightning and MLFLow. PL puts stuff into the `outputs` folder, but `mlflow` puts things (run metadata, hyper-parameters, etc) into a `mlruns` folder, which can be configured via an environment variable. - How do you consolidate the things from MLFlow and PL, for instance model checkpoints? - Answer: I don't know. But perhaps using this? https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.loggers.mlflow.html#mlflow-logger - How do you avoid having two people using the same MLFlow directory, given that mlflow doesn't take everything into account when hashing the params to get the mlruns directory? - Maybe don't use a shared directory in /network/projects for your mlflow (at first). First, write to your "individual" mlflow dir, and then copy over to the shared directory? - Where to place datasets when we want to share it within a project? - Can we place them in our folder within /network/projects? (Yes) - ## DataParallel question Context: I'm using DataParallel with my PyTorch module. My batch size is fixed. I'd like to scale to more GPUs and/or more nodes. I'm having trouble managing devices and memory with DataParallel. Suggestions: - Consider using something like HuggingFace's accelerate, or PyTorch-Lightning, and letting it take care of doing the parallelization for you. - HuggingFace Documentation: https://huggingface.co/docs/transformers/accelerate - PyTorch-Lightning Documentation: https://pytorch-lightning.readthedocs.io/en/latest/accelerators/gpu.html - If you want to do things yourself, look at DistributedDataParallel how-tos and guides on the PyTorch documentation. -------------- # July 5th, 2022 IDT Attendees: @normandf and the one and only @obilaniu (Olexa) ## Calculating "accuracy" for an NLP model takes forever (~600 hours), how can I make it faster? ### Situation: Dataset of size D Number of patterns P Currently looks a bit like this ```python matcher = Matcher(patterns=P) for pattern in P: for dataset_entry in D: predictions = model(pattern, dataset_entry) calculate_accuracy(predictions, dataset_entry) ``` ### Solution: Parallelism In order of priority: 1. Batching over the entire dataset for a given pattern (if possible) 2. Minizing creation of objects / etc inside the double for-loop, move things up one level 3. Use `multiprocessing.Pool`, using the simplest example first: https://docs.python.org/3/library/multiprocessing.html, and maybe passing the models / matchers as arguments to the function that you map 4. User multiprocessing, with custom Processes, so that you don't encur the overhead of copying the shared Matcher. ## DGL error with torch sparse operation (https://mila-umontreal.slack.com/archives/CFAS8455H/p1656965418230929) Seems like there are conflicts between multiple installations of DGL (Pip + conda), and with different versions (.7 and .8). Had a good chat with @obilaniu, and we'll see if using a single version / source (preferably conda) will help. ## Can't build / install / run Mujoco-py on the ComputeCanada clusters. Do I need root access to run some GCC-related stuff? Answer: No. Load the mujoco module, that should save you some trouble. Then, if you still have trouble installing mujoco-py, let us know! (There might be some people who've gotten MuJoCO to work on CC, ask around on #ComputeCanada) -------------- # June 21st, 2022 IDT Attendees: In-person: Soline Blanc, Remote: @normandf, satya.ortiz-gagne@mila.quebec ## How to use wandb sweep API on the ComputeCanada cluster? A: I don't think that's possible. Asking Mattie Tesfaldet on Slack if they've used it before. Seems unlikely that it would work. Q: How can I get the "sweep id" which is normally output by the `wandb sweep` command? A: Not sure how to do that via the command-line, but you can create a sweep via the GUI instead of the command-line, and use the sweep ID from there. ## Using Bert, getting OOM Context: Saving the 4 last layer outputs vs saving only the last layer outputs. First works fine, but second gets OOM, even though it should in principle use less memory. - When saving tensors in a list (e.g. model representations), make sure to call .detach() explicitly. Otherwise, the entire computational graph that created these tensors is also kept alive in memory, and your program will crash! One very easy way to check if this is what's happening: - Is your program crashing after at least one successful iteration/batch? - If yes, then this is likely the issue! - If not, then there is some other issue. --------------- # June 21st, 2022 IDT Attendees: @normandf , Olivier Breleux, Soline Blanc ## Using Orion, with MongoDB Storage: - How to cleanly separate different projects? - very slow to get the newest experiment to show up with `orion status` - One level above "experiment": "Project"? - Has already been suggested: https://github.com/Epistimio/orion/issues/815 - Current work-around idea: `orion db archive path/to/db_pickle.pkl` ## Can't connect to Mila cluster: "Authentication failed" - Need to wait 24h after the intro to Mila cluster. ## Can't use remote debugging (breakpoints) on the Mila cluster when using `mila code` Solved: The `Python` extension wasn't installed on the remote! - One good way to go is to enable "settings sync" with VSCode, such that the extensions get synced automatically whenever a vscode window is launched. - It's also important to note, the SLURM-related environment variables (e.g. `"SLURM_TMPDIR"`) aren't readable from the vscode terminals that you get with `mila code`. You do however get the `"SLURM_JOB_ID"` variable, which can be used to recover the `SLURM_TMPDIR`. --------------- # June 14th, 2022 IDT Attendees: @normandf, Soline Blanc, Guillaume Alain ## Alpha metric for SSL representation learning. - Q: How to generalize the current setup to other models / etc? How to make it easier for other researchers to also compute / use that metric? - Fake question: How to do this kind of "Sweep over types of models to find the ones with the best Alpha, and then run a linear probe on only the most promising ones". 2-part answer: 1. Distribute it as a pip-installable Github Gist, e.g. https://gist.github.com/lebrice/728f0e0218dcef6f74cbebd70af5857e 2. Use Orion as part of your sweeps, maybe? ## Using JAX, Getting GPU OOM errors, Memory usage grows way faster than expected Q: Are you doing the equivalent of storing grad-enabled variables in a list or something? (common pitfall in Torch)? Solved!: > I was finally able to solve it following your suggestion about investigating memory leak : turns out JAX does not always clear the cache of compiled function. Problem came from a structure like this one: ```python def create_some_fn(args): @jax.jit def some_fn(other_args): def scan_fn(carry, x): ... return next_carry, to_stack _, array_of_interest = jax.lax.scan(scan_fn, ...) return array_of_interest ... return output return some_fn ``` > This allowed me to redefined (and force recompilation) of some_fn when needed; which was the intended behaviour. But, cache of the previous some_fn was not cleared when doing so. Related issues on github: https://github.com/google/jax/issues/7930 https://github.com/google/jax/issues/2072 https://github.com/google/jax/discussions/10838#discussioncomment-2827887 https://github.com/google/jax/issues/10828 Thanks and see you! ## How to find cycles of len >=3 in a graph? - ?? ## Using Submitit to run SLURM jobs on the Cluster. Q: What are some of the best practices for running tons of jobs / using resources / etc? Answer: - Use sbatch Array jobs through submitit - Use Orion for sweeps (so that different jobs contribute to the same HPO sweep in parallel) - Maybe use the python API of Orion, in combination with submitit! That would be probably the coolest setup. Q: Does subitit implement the good principles for moving things to SLURM_TMPDIR, etc? ## Dataloader-related question: Dataset is TensorDataset of CUDA tensors. Should I set the `pin_memory` or `num_workers` arguments of the `DataLoader` class? Answer: No, I don't suggest using dataloader workers if the dataset fits in memory. Workers are useful when reading from disk, or when you'd benefit from doing transforms in parallel. Replicated the setup, without https://gist.github.com/lebrice/47ea38111a00cca13b5b71c30c70c7eb --------------- # June 7th, 2022 IDT Attendees: @normandf, satya.ortiz-gagne@mila.quebec, Soline Blanc, Guillaume Alain ## How to write unit tests for ML code. GREAT QUESTION! Will organize a tutorial on this soon. - Doctests: Great way to document + test your little utility functions - Pytest can also run doctests! - Coverage (e.g. pytest-cov) is also great - Testing a neural network: - Check that output shapes / dtypes / etc - Check that loss.backward() produces a grad on all the weights that it should - Grouping tests into classes, inheriting from other test classes - Write tests before the components, (ideally) - When writing tests for existing code, don't just test that it does what it already does (if that makes sense?) - Instead, come up with the simplest input/output that ISN'T in the code, and that you 100% understand, and check that it indeed produces that output for that given input. ## How to see / track GPU utilization from outside a running job A: SSH into the node, and then run nvidia-smi -l 5 to run nvidia-smi every 5 seconds. ## Pickle.load a dataset, takes a while. Does passing a Protocol to pickle.load make it faster? Yes, using the latest protocols makes loading of numpy arrays much faster (Guillaume Alain) ## Should I switch from PyTorch to Jax? Maybe? Not sure. Jax code can be really really ugly, if you don't know what you're doing. Jax and PyTorch can interop! https://github.com/google/brax/blob/fc775ecf4470b725cf5f0cd47cc360252c49544c/brax/io/torch.py ## Untarring a big dataset on the mila cluster: - Shared a link to [this shacl thread about untarring imagenet]() Solution: - Untar the dataset directly to SLURM_TMPDIR ## Request/Question: Getting the largest language models from OPT to work Could we setup an internal "API" to run these models? https://github.com/facebookresearch/metaseq/tree/main/projects/OPT - Needs 8 GPUs, IDEA: - Using a sbatch job to spin up a simple API endpoint e.g. using https://fastapi.tiangolo.com/#example - Use the URL to the compute node to send prompts for the language model. ## AIA team: Where to Store a dataset / models so they are publicly accessible? Answer: google drive However, currently, they would need one API Token per person in the AIA team, in order to be able to upload the models / datasets through the python google drive API. IDEA: Maybe a google group could be setup for the members of that team, and a shared google drive could be created (same as in IDT) so they can upload files to it and selectively make those public. Follow up @stephane gazaille ---------------- # May 31th, 2022 Organizers: @normandf, satya.ortiz-gagne@mila.quebec ## Torch-Typing ? - Haven't used it, looks interesting! (Uses variadic generics, a new typing feature of Python). ## Remote debugging with PyCharm on the Mila cluster? https://mila-umontreal.slack.com/archives/CFAS8455H/p1654024801663419 Answer: Possible with PyCharm Pro, but the user-experience is reportedly not great. VSCode remains the best option for remove debugging on the MIla / CC clusters. There's also a tutorial/webinar on how to use VSCode on the Compute Canada clusters: https://www.youtube.com/watch?v=u9k6HikDyqk ## Flower (again): Trouble getting the different nodes to in a job array to communicate -> First, try using srun in two terminals side-to-side, where in the left terminal, you srun a job with the SLURM_ARRAY_TASK_ID set to 0, and on the right, using SLURM_ARRAY_TASK_ID set to 1. If that doesn't work, then the problem is probably that flower doesn't know where the server is. One potential solution: - In the first job, dump the result of `socket.gethostname()` to a file like `server_address.txt`. - In the other jobs, wait until the file exists, and read it, and then use the value in the call to `fl.client.start_numpy_client`. --------------- # May 24th, 2022 Organizer: @normandf ## Flower follow-up Multiprocessing error, can't pickle _thread.Lock object Follow-up to [last week's question](https://hackmd.io/OsKKGG3QSTawhaWMMMRaeA#Using-FLWR-How-do-I-run-LOTS-of-clients). The solution from last week (using multiprocessing.Pool) doesn't quite work. Solution: Use Processes instead: <details> <summary>Solution: Use the --array option of sbatch to run multiple jobs, and then use multiprocessing to run multiple processes within each job. However, Use Processes instead: </summary> ```python import multiprocessing as mp import os from client import run_client from server import run_server as server_fn total_clients = 10 def main(run_server: bool, num_clients: int): """Called on every node. Runs a server if `run_server` is True, and runs `num_clients` clients. The server and each client are separate processes. """ if run_server: server_process = mp.Process(target=server_fn) server_process.start() else: server_process = None # List containing the client processes. client_processes: list[mp.Process] = [] # Create all the clients (without starting them). for client_id in range(num_clients): client_process = mp.Process(target=run_client, args=(client_id,)) client_processes.append(client_process) # Start all the clients. for client_process in client_processes: client_process.start() # Wait for all the clients to finish. for client_process in client_processes: client_process.join() # Wait for the server to finish (if we created it). if server_process: server_process.join() if __name__ == "__main__": # This file gets run on multiple different nodes, as part of a sbatch job with an --array value. # NOTE: Code below is equivalent to this: # running_on_slurm = "SLURM_ARRAY_TASK_ID" in os.environ # if running_on_slurm: # index_of_this_job = int(os.environ["SLURM_ARRAY_TASK_ID"]) # total_number_of_jobs = int(os.environ["SLURM_ARRAY_TASK_COUNT"]) # else: # index_of_this_job = 0 # total_number_of_jobs = 1 # This is the index of this job, in the job array. index_of_this_job = int(os.environ.get("SLURM_ARRAY_TASK_ID", 0)) total_number_of_jobs = int(os.environ.get("SLURM_ARRAY_TASK_COUNT", 1)) clients_per_node = total_clients // total_number_of_jobs if index_of_this_job == 0: # if we're the first array job, then we start a server + clients main(run_server=True, num_clients=clients_per_node) elif index_of_this_job == total_number_of_jobs - 1: # If we're the last array job, we get a different number of clients, so the total is indeed # `total_clients`. clients_for_last_job = clients_per_node + (total_clients % total_number_of_jobs) main(run_server=False, num_clients=clients_for_last_job) else: # Middle "regular" job: Only start `clients_per_node` clients. main(run_server=False, num_clients=clients_per_node) ``` </details> ## Any advice on using Motion Capture data in MuJoCO? Hmm not really. Maybe modify one of the humanoid models to add some motion capture sensor nodes? --> Refered to Florian Golemo, I think he might know something about this. ## Error when pip installing some package (d4rl_py) on the login node --> Installed on the login node, doesn't work! Need to install it on a compute node first. ## Installing mujoco-py for a different mujoco version than 2.0 (e.g. 2.2)? For Mujoco 2.0, the `module load mujoco/2.0` command works best. However, for newer versions, you have to: - Download the archive from https://github.com/deepmind/mujoco/releases (now that it's open source, yay!) - extract it somewhere - add the path to that ## How best to get started with Wandb, basically? ---------------- # May 17th, 2022 Organizer: @normandf ## How to launch many jobs that each does slightly different things (e.g. different argument values? <details> <summary>Solution: Use the --array option of sbatch, and choose the parameters based on the job index.</summary> ```python from typing import NamedTuple import os from functools import partial def some_complicated_function(music = True, artist=False, album=False, track=False): pass # Approach #1: List of partially configured functions: combinations_of_args = [ partial(some_complicated_function, music=True, artist=True, album=True, track=True), partial(some_complicated_function, music=False, artist=True, album=True, track=True), partial(some_complicated_function, music=True, artist=True, album=True, track=True), partial(some_complicated_function, music=False, artist=True, album=True, track=True), partial(some_complicated_function, music=True, artist=True, album=True, track=True), ] # Approach #2: List of named tuples: class ArgsForFunction(NamedTuple): music: bool artist: bool album: bool track: bool combinations_of_args = [ ArgsForFunction(music=True, artist=True, album=True, track=True), ArgsForFunction(music=True, artist=True, album=False, track=True), ArgsForFunction(music=True, artist=True, album=False, track=True), ArgsForFunction(music=True, artist=True, album=False, track=True), ArgsForFunction(music=True, artist=True, album=True, track=True), ] def main(): # Get the job index from the environment variable. task_id = os.environ.get("SLURM_ARRAY_TASK_ID", 0) args_for_function = combinations_of_args[task_id] some_complicated_function(**args_for_function._asdict()) if __name__ == "__main__": main() ``` </details> ## Using [FLWR](https://github.com/adap/flower): How do I run LOTS of clients? <details> <summary>Solution: Use the --array option of sbatch to run multiple jobs, and then use either multiprocessing or the --ntasks sbatch option to run multiple processes within each job. </summary> ```python """ FLWR: How to run lots of clients? 1. create multiple jobs with the --array option of sbatch, e.g. --array=0-10 2.1 Within each job, run multiple clients in parallel using multiprocessing. 2.2 Alternatively, run multiple tasks within each job using the slurm option --tasks, and have each task be either a client or a server. This assumes that the server and client can effectively communicate with each other across nodes. This also assumes that it's OK to not have all the clients alive at the same exact time. """ import multiprocessing as mp import os def client_fn(client_id: int): print(f"Hi i'm the client #{client_id}.") def server_fn(): print("Hi i'm the server.") def main(run_server: bool, num_clients: int): """ Option 1: Do everything in Python""" server_process = None if run_server: server_process = mp.Process(target=server_fn) server_process.start() with mp.Pool(mp.cpu_count()) as pool: pool.map(client_fn, range(num_clients)) if server_process: server_process.join() total_clients = 1000 if __name__ == "__main__": job_index = int(os.environ.get("SLURM_ARRAY_TASK_ID", 0)) job_count = int(os.environ.get("SLURM_ARRAY_TASK_COUNT", 1)) clients_per_node = total_clients // job_count if job_index == 0: main(run_server=True, num_clients=clients_per_node) elif job_index == job_count - 1: clients_for_last_job = clients_per_node + (total_clients % job_count) main(run_server=False, num_clients=clients_for_last_job) else: # Middle "regular" job: main(run_server=False, num_clients=clients_per_node) ``` </details> ## If I know the batch size, how much GPU memory should I need? How can I ask for only what I need, in terms of GPU memory? First: tough question. As far as I know, in general, memory usage scales about linearly with the batch size, so maybe try a few simple jobs, and extrapolate + some headspace just to be safe? Second: Great idea. Asking for only the GPU memory you need is a great thing to do: - it makes your job easier to schedule - as far as I (@normandf) know - it makes it possible for others to use the big GPUs if they need it I don't know how to ask for a specific amount of GPU memory with --gres, but you can ask for different types of GPUs, for example here on ComputeCanada: https://docs.alliancecan.ca/wiki/Using_GPUs_with_Slurm#Requesting_a_P100-16G_GPU_node_on_Cedar ## I copy my code to SLURM_TMPDIR, but it uses the modules in $HOME! Why? Copying a virtual environment doesn't necessarily copy all the packages it uses. (I @normandf am not an expert on virtual environments though) ## Better support for torch geometric? It's very difficult to install. Sure! I suggest making a ticket, and I will also bring this up at the dev meeting. ## General procedure for copying datasets from slower to faster directories? The [shutil package](https://docs.python.org/3/library/shutil.html) can be quite useful for this! -----------------