Cosmos - HackMD

# Cosmos ------- ### Install Cosmos runs only on Linux systems. Nvidia has tested the installation with Ubuntu 24.04, 22.04, and 20.04. Cosmos requires the Python version to be 3.12.x. **Step.1 Clone the cosmos-transfer1 source code** ``` git clone git@github.com:nvidia-cosmos/cosmos-transfer1.git cd cosmos-transfer1 git submodule update --init --recursive ``` **libnvrtc check** Check libnvrtc.so exists ``` find /usr -name "libnvrtc.so*" 2>/dev/null | head -n 10 ``` **Step.2 Inference using conda** This example is run on Conda. Please make sure Conda is installed. ``` # Create the cosmos-transfer1 conda environment. conda env create --file cosmos-transfer1.yaml # Activate the cosmos-transfer1 conda environment. conda activate cosmos-transfer1 # Install the dependencies. pip install -r requirements.txt # Install vllm pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl export VLLM_ATTENTION_BACKEND=FLASHINFER pip install vllm==0.9.2 # Install decord pip install decord==0.6.0 pip install https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/v1.1.0/apex-0.1+cu128.torch271-cp312-cp312-linux_x86_64.whl pip install https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/v1.1.0/flash_attn-2.6.3+cu128.torch271-cp312-cp312-linux_x86_64.whl pip install https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/v1.1.0/natten-0.21.0+cu128.torch271-cp312-cp312-linux_x86_64.whl pip install https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/v1.1.0/transformer_engine-1.13.0+cu128.torch271-cp312-cp312-linux_x86_64.whl pip install https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/v1.1.0/torch-2.7.1+cu128-cp312-cp312-manylinux_2_28_x86_64.whl pip install https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/v1.1.0/torchvision-0.22.1+cu128-cp312-cp312-manylinux_2_28_x86_64.whl # Patch Transformer engine linking issues in conda environments. ln -sf $CONDA_PREFIX/lib/python3.12/site-packages/nvidia/*/include/* $CONDA_PREFIX/include/ ln -sf $CONDA_PREFIX/lib/python3.12/site-packages/nvidia/*/include/* $CONDA_PREFIX/include/python3.12 apt-get install -y libmagic1 ``` **To test the environment setup for inference run** ``` PYTHONPATH=$(pwd) python scripts/test_environment.py ``` **Step.3 Download Checkpoint** Generate a Hugging Face access token. Set the access token to 'Read' permission (default is 'Fine-grained'). Log in to Hugging Face with the access token: ``` huggingface-cli login ``` Please copy the access token shown below. You will need it to log in `huggingface-cli login`. ![accesstoken](https://hackmd.io/_uploads/rygpJ8sZWe.jpg) **Accept the [Llama-Guard-3-8B terms](https://huggingface.co/meta-llama/Llama-Guard-3-8B)** You need to agree to share your contact information to access this model ![llama](https://hackmd.io/_uploads/rJm0kUo-bl.jpg) **After completing the above steps** **Download the Cosmos model weights** The checkpoint requires about 300GB of storage space. ``` PYTHONPATH=$(pwd) python scripts/download_checkpoints.py --output_dir checkpoints/ ``` **The above are the NVIDIA-Cosmos installation steps** --------- ### Inference_cosmos_transfer1_7b Before using NVIDIA Cosmos, let's first take a look at its architecture ![transfer1_diagram](https://hackmd.io/_uploads/ByQex8sbZe.png) The Cosmos model is built on top of a Diffusion Transformer–based architecture Based on the diagram above, we can observe that video generation is achieved by supplying an input ***video*** and ***text***, and choosing the appropriate branch **Therefore, the input methods can be summarized as follows:** ![ttt](https://hackmd.io/_uploads/HJJzxLoW-g.jpg) -------------------------------------------- **Arguments** | Parameter | Description | Default | |------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | --controlnet_specs | A JSON describing the Multi-ControlNet config | JSON | | --checkpoint_dir | Directory containing model weights | "checkpoints" | | --tokenizer_dir | Directory containing tokenizer weights | "Cosmos-Tokenize1-CV8x8x8-720p" | | --input_video_path | The path to the input video | None | | --video_save_name | Output video filename for single video generation | "output" | | --video_save_folder | Output directory for batch video generation | "outputs/" | | --prompt | Text prompt for video generation. | "The video captures a stunning, photorealistic scene with remarkable attention to detail, giving it a lifelike appearance that is almost indistinguishable from reality. It appears to be from a high-budget 4K movie, showcasing ultra-high-definition quality with impeccable resolution." | | --negative_prompt | Negative prompt for improved quality | | | --num_steps | Number of diffusion sampling steps | 35 | | --guidance | CFG guidance scale | 7.0 | | --sigma_max | The level of partial noise added to the input video in the range [0, 80.0]. Any value equal or higher than 80.0 will result in not using the input video and providing the model with pure noise. | 70.0 | | --blur_strength | The strength of blurring when preparing the control input for the vis controlnet. Valid values are 'very_low', 'low', 'medium', 'high', and 'very_high'. | 'medium' | | --canny_threshold | The threshold for canny edge detection when preparing the control input for the edge controlnet. Lower threshold means more edges detected. Valid values are 'very_low', 'low', 'medium', 'high', and 'very_high'. | 'medium' | | --fps | Output frames-per-second | 24 | | --seed | Random seed | 1 | | --offload_text_encoder_model | Offload text encoder after inference, used for low-memory GPUs | False | | --offload_guardrail_models | Offload guardrail models after inference, used for low-memory GPUs | False | | --upsample_prompt | Upsample prompt using prompt upsampler model | False | | --offload_prompt_upsampler | Offload prompt upsampler models after inference, used for low-memory GPUs | False | **!!Notice that to run Cosmos on GPUs with low memory capacity, it is necessary to offload portions of the model when required; otherwise, the system may encounter out-of-memory issues.** ------------------ **Next, we'll run Cosmos-Transfer for inference** These are the sample commands provided by NVIDIA Cosmos. ``` export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}" export NUM_GPU="${NUM_GPU:=1}" PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \ --checkpoint_dir checkpoints \ --input_video_path path/to/input_video.mp4 \ --video_save_name output_video \ --controlnet_specs spec.json \ --offload_guardrail_models \ --num_gpus $NUM_GPU ``` You may define your desired configuration in a ***JSON file*** and supply it through the --controlnet_specs parameter. Example: ``` export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}" export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}" export NUM_GPU="${NUM_GPU:=1}" PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \ --checkpoint_dir $CHECKPOINT_DIR \ --video_save_folder outputs/example1_single_control_edge \ --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \ --offload_text_encoder_model \ --offload_guardrail_models \ --num_gpus $NUM_GPU ``` ------------------------ # Using G5 camera video as input **1.Generating driving scenarios** With the basic tutorial above, we can now feed video from the G5 camera into Cosmos to generate output videos. ![edge](https://hackmd.io/_uploads/ryfQgUoWbg.png) The figure above illustrates that when edge control is enabled, the input video is automatically processed through a Canny edge detector to produce an edge-based video. This step is handled automatically, and users only need to configure the threshold parameters in the arguments ``` export CUDA_VISIBLE_DEVICES=0 export NUM_GPU=1 PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 \ cosmos_transfer1/diffusion/inference/transfer.py \ --checkpoint_dir ./checkpoints \ --input_video_path /home/user/Downloads/output_video.mp4 \ --video_save_name rgb_output \ --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \ --offload_guardrail_models \ --offload_diffusion_transformer \ --offload_text_encoder_model \ --offload_prompt_upsampler \ --num_gpus $NUM_GPU \ --prompt "let the car driving in heavy rainy and must have waterlogged on the road, and must have raindrops falling ,the camera shot should have damp feel" ``` Since the test was performed on a single L40S GPU, which has only 48GB of VRAM, you can see that I offloaded many of the model components. You can adjust these settings based on the capabilities of your own GPU. Under this setup, generating a single video takes approximately 25 minutes ![rain](https://hackmd.io/_uploads/HJn7eIjWWx.jpg) --------- **2.Generate object** Because edge control preserves the overall structure of the scene, modifying or generating new objects often requires altering that structure. By adjusting the ***control_weight***, you can increase the model’s flexibility, allowing for greater creativity in the generated output. The control_weight parameter is a number within the range [0, 1] that controls how strongly the controlnet branch should affect the output of the model. The larger the value (closer to 1.0), the more strongly the generated video will adhere to the controlnet input. The control_weight can also be modified directly in the **JSON** configuration file ![car](https://hackmd.io/_uploads/HkPEg8jWZx.png) The figure above demonstrates that lower control weight values provide greater generative freedom, at the cost of potentially altering or disrupting the original scene structure ----------------------- **3.Improve quality** Since the generated videos are in 720p resolution, NVIDIA Cosmos provides a 4K Upscaler for improving video quality. This allows you to further enhance the resolution and overall visual fidelity of the output. ``` export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}" export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}" export NUM_GPU="${NUM_GPU:=1}" PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \ --checkpoint_dir $CHECKPOINT_DIR \ --input_video_path /home/user/Downloads/output_video.mp4 \ --video_save_folder outputs/inference_upscaler \ --controlnet_specs assets/inference_upscaler.json \ --num_steps 10 \ --offload_text_encoder_model \ --num_gpus $NUM_GPU ``` ![4k](https://hackmd.io/_uploads/S1hzkAHfZl.jpg) ------- **Batch generation** It can also be generated in batches, provided through the --batch_input path argument. Each line in the JSONL must contain a visual_input field equivalent to the --input_video_path argument in the case of single control generation. You must first complete your JSONL file. ``` {"visual_input": "C:\Users\fly\Desktop\video0.mp4", "prompt": "heavy rain"} {"visual_input": "C:\Users\fly\Desktop\video1.mp4", "prompt": "leftside forest get fire", "control_overrides": {"seg": {"input_control": "path/to/video1_seg.mp4"}, "depth": {"input_control": null}}} {"visual_input": "C:\Users\fly\Desktop\video2.mp4", "prompt": "snowy weather", "control_overrides": {"seg": {"input_control": "path/to/video2_seg.mp4"}, "depth": {"input_control": "path/to/video2_depth.mp4"}}} ``` ``` export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}" export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}" export NUM_GPU="${NUM_GPU:=1}" PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \ --checkpoint_dir $CHECKPOINT_DIR \ --video_save_folder outputs/example2_uniform_weights \ --controlnet_specs assets/inference_cosmos_transfer1_uniform_weights.json \ --offload_text_encoder_model \ --batch_input_path path/to/batch_input_path.json \ --num_gpus $NUM_GPU ```