The Cosmos-1.0-Diffusion-7B and 14B models, in both Text2World and Video2World versions, are NVIDIA’s advanced AI tools for generating videos. Text2World focuses on creating visuals from textual descriptions, while Video2World specializes in generating dynamic video sequences. This hack enables these models to run concurrently on two Jetson AGX Orin 64GB kits, improving efficiency by manually splitting tasks due to the lack of MPI support. The fixed batch size of 1 prevents splitting the batch itself, so this hack focuses on leveraging classifier-free guidance to separate conditional and unconditional predictions. Here’s how it works and how you can try it yourself.
The Idea Behind the HackAI video generation often relies on diffusion models to iteratively refine random noise into visuals. For the Cosmos model, inference starts with a noisy representation of all 121 video frames in the latent space. These are refined over 35 real denoising steps, each reducing noise to reveal the video content.
Classifier-free guidance uses two predictions:
- Conditional Prediction: Uses the text prompt to guide generation.
- Unconditional Prediction: Adds diversity and prevents overfitting.
Tasks are split across two Jetson Orin devices, with Pyro5 handling remote calls and msgpack-numpy managing tensor serialization. This enables smooth parallel execution, processing all 121 frames in one pass. The hack ensures efficient operation despite the fixed batch size of 1 and lack of MPI support, demonstrating significant performance gains.
Why Jetson AGX Orin?The Jetson AGX Orin is well-suited for this project due to its:
- 64GB of RAM: Sufficient for managing memory-intensive diffusion models smoothly.
- Ampere GPU architecture: Offers up to 275 TOPS of AI performance for demanding tasks.
- Energy efficiency: Suitable for edge computing with lower power consumption.
Using two Orin devices simplifies task distribution without requiring complex setups.
Hardware Setup- Get two Jetson AGX Orin 64GB boards.
- Install M.2 SSDs on both Orins for faster storage.
- Connect the boards with a Cat7 cable or similar high-speed network setup.
- Power the devices with the included USB Type-C power bricks.
Prepare the Jetson AGX Orins:
- Install NVIDIA JetPack 6.2 SDK on both devices to set up the required libraries and drivers.
- Configure the ethernet network to connect both devices in the same network. I used 192.168.1.1 for main device and 192.168.1.2 for the helper device
- It may require setting the MDIX on or use a hub/swith/crossover cable
sudo ethtool -s eno1 mdix on
Clone the Repository:
git clone https://github.com/andrei-ace/Cosmos-2xJetson.git
Download Models:
pip3 install 'huggingface_hub[cli,torch]' transformers
pip3 install --upgrade scipy
huggingface-cli login
PYTHONPATH=$(pwd) python3 cosmos1/scripts/download_diffusion.py --model_sizes 7B --model_types Text2World
PYTHONPATH=$(pwd) python3 cosmos1/scripts/download_t5.py
Build the custom Docker image:
sudo docker build . -t t2w_2xorin -f Dockerfile.t2w_2xorin
Make sure the correct CUDA computing capability is set (8.7 for Orin AGX):
ENV TORCH_CUDA_ARCH_LIST="8.7"
ENV NVTE_CUDA_ARCHS="87"
If these settings are not configured, you may encounter errors. This issue occurred for me on JetPack 6.2 when using older Jetson containers with cross-attention operations from transformer_engine_torch, which this model relies on.
RuntimeError: CUDA error: no kernel image is available for execution on the device
Test the Docker Container:
- Start two docker containers on the main device:
sudo docker run --runtime nvidia -it --rm \
--network=host \
--volume ./checkpoints:/workspace/checkpoints \
--volume ./outputs:/workspace/outputs t2w_2xorin
- On the first container, run the remote helper:
PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/remote_helper.py
- On the second container, test the remote helper:
PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/test_remote_helper.py
Check Results:
- Using the 50W power mode: ~160 seconds per iteration.
- Using MAXN power mode, each iteration of denoising takes approximately 105 seconds. While this aligns with reported experiences from other users, it may also produce over-current warnings. Given that the process involves two predictions per iteration (conditional and unconditional) and 35 denoising steps, generating one video takes roughly 2 hours in total.
Copy the Cosmos-1.0-Diffusion-7B-Text2World weights from the checkpoints diretory to the second Jetson
Copy the t2w_2xorin docker image to the second Jetson by using docker save and docker load.
Source code- This function predicts the noise added to an image and is designed to offload work from the first Jetson Orin to the second.
@torch.no_grad()
def remote_denoise(self,
noise_x_bytes: bytes,
sigma_bytes: bytes,
condition_type: str,
condition_bytes: bytes
) -> bytes:
try:
start_time = time.time() # Start timing
noise_x = torch.tensor(m.unpackb(noise_x_bytes), device=self.model.tensor_kwargs["device"], dtype=self.model.tensor_kwargs["dtype"])
sigma = torch.tensor(m.unpackb(sigma_bytes), device=self.model.tensor_kwargs["device"], dtype=self.model.tensor_kwargs["dtype"])
if condition_type == "BaseVideoCondition":
condition_dict = m.unpackb(condition_bytes)
# Convert numpy arrays to tensors in condition_dict
condition_dict = {
key: torch.tensor(val, device=self.model.tensor_kwargs["device"], dtype=self.model.tensor_kwargs["dtype"])
if isinstance(val, np.ndarray) else val
for key, val in condition_dict.items()
}
condition = BaseVideoCondition(**condition_dict)
elif condition_type == "VideoExtendCondition":
condition_dict = m.unpackb(condition_bytes)
# Convert numpy arrays to tensors in condition_dict
condition_dict = {
key: torch.tensor(val, device=self.model.tensor_kwargs["device"], dtype=self.model.tensor_kwargs["dtype"])
if isinstance(val, np.ndarray) else val
for key, val in condition_dict.items()
}
condition = VideoExtendCondition(**condition_dict)
else:
raise ValueError(f"Unknown condition type: {condition_type}")
x0 = self.model.denoise(noise_x, sigma, condition=condition).x0
x0_bytes = x0.float().cpu().numpy()
end_time = time.time() # End timing
print(f"Denoising took {end_time - start_time:.2f} seconds")
return m.packb(x0_bytes)
except Exception as e:
import traceback
print(f"Error in remote_denoise: {e}")
print("Full stack trace:")
print(traceback.format_exc())
return None
- This is the main step of denoising within the Cosmos-1.0 model. It enables running the unconditioned noise prediction on a second Jetson Orin if configured. By utilizing Pyro5 for remote procedure calls and msgpack-numpy for tensor serialization, this setup efficiently splits tasks between the devices, ensuring parallel execution.
def x0_fn(noise_x: torch.Tensor, sigma: torch.Tensor) -> torch.Tensor:
log.info(f"Running for noise level: {sigma.item()}")
torch.cuda.synchronize() # Ensure all CUDA operations are complete
start_time = time.time() # Start timing
if remote_denoiser_uri is not None:
with ThreadPoolExecutor(max_workers=2) as executor:
torch.cuda.synchronize() # Ensure all CUDA operations are complete
start_time_cond = time.time()
cond_x0_future = executor.submit(local_thread, noise_x, sigma, condition)
cond_x0_future.add_done_callback(lambda fut: log_time(fut, start_time_cond, "Conditioned"))
torch.cuda.synchronize() # Ensure all CUDA operations are complete
start_time_uncond = time.time()
uncond_x0_future = executor.submit(remote_thread, noise_x, sigma, uncondition)
uncond_x0_future.add_done_callback(lambda fut: log_time(fut, start_time_uncond, "Unconditioned"))
cond_x0 = cond_x0_future.result()
uncond_x0 = uncond_x0_future.result()
else:
cond_x0 = self.denoise(noise_x, sigma, condition).x0
uncond_x0 = self.denoise(noise_x, sigma, uncondition).x0
raw_x0 = cond_x0 + guidance * (cond_x0 - uncond_x0)
if "guided_image" in data_batch:
# replacement trick that enables inpainting with base model
assert "guided_mask" in data_batch, "guided_mask should be in data_batch if guided_image is present"
guide_image = data_batch["guided_image"]
guide_mask = data_batch["guided_mask"]
raw_x0 = guide_mask * guide_image + (1 - guide_mask) * raw_x0
torch.cuda.synchronize() # Ensure all CUDA operations are complete
end_time = time.time() # End timing
log.info(f"Denoising operation total time: {end_time - start_time:.2f} seconds.")
return raw_x0
The tensors being transferred are relatively small, around 20MB in size when serialized as float32 with numpy. Given the computational intensity of the denoising steps, the overhead introduced by serialization and remote invocation through Pyro5 and msgpack-numpy is minimal. This allows the offloading process to remain efficient without significantly impacting performance.
Running the Project- Start the container on each Jetson Orin:
sudo docker run --runtime nvidia -it --rm \
--network=host \
--volume ./checkpoints:/workspace/checkpoints \
--volume ./outputs:/workspace/outputs t2w_2xorin
- Helper, second Jetson Orin
PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/remote_helper.py
- Main Orin (192.168.1.2 is the IP of the second Jetson Orin)
PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/text2world.py --checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World --prompt "$PROMPT" \
--video_save_name Cosmos-1.0-Diffusion-7B-Text2World_remote \
--disable_prompt_upsampler \
--offload_guardrail_models \
--offload_prompt_upsampler \
--remote_denoiser_uri "PYRO:remote_denoiser@192.168.1.2:9090"
- Optionally run the model on a single board
PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/text2world.py --checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World --prompt "$PROMPT" \
--video_save_name Cosmos-1.0-Diffusion-7B-Text2World \
--disable_prompt_upsampler \
--offload_guardrail_models \
--offload_prompt_upsampler
Comparing the ResultsRunning the model on a single Jetson Orin board:
[01-18 00:06:11|INFO|cosmos1/models/diffusion/model/model_t2w.py:222:x0_fn] Running for noise level: 0.0020000000949949026
[01-18 00:09:39|INFO|cosmos1/models/diffusion/model/model_t2w.py:254:x0_fn] Denoising operation total time: 207.94 seconds.
[01-18 00:10:23|INFO|cosmos1/models/diffusion/inference/world_generation_pipeline.py:367:generate] Finish generation
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[01-18 00:10:25|INFO|cosmos1/models/diffusion/inference/text2world.py:162:demo] Saved video to outputs/Cosmos-1.0-Diffusion-7B-Text2World.mp4
[01-18 00:10:25|INFO|cosmos1/models/diffusion/inference/text2world.py:163:demo] Saved prompt to outputs/Cosmos-1.0-Diffusion-7B-Text2World.txt
real 129m56.890s
user 18m8.579s
sys 47m53.788s
Running the model with the helper enabled: achieving twice the speed through efficient parallel processing.
[01-18 01:23:48|INFO|cosmos1/models/diffusion/model/model_t2w.py:222:x0_fn] Running for noise level: 0.0020000000949949026
[01-18 01:25:31|INFO|cosmos1/models/diffusion/model/model_t2w.py:191:log_time] Conditioned execution time: 103.60 seconds
[01-18 01:25:31|INFO|cosmos1/models/diffusion/model/model_t2w.py:191:log_time] Unconditioned execution time: 103.97 seconds
[01-18 01:25:31|INFO|cosmos1/models/diffusion/model/model_t2w.py:254:x0_fn] Denoising operation total time: 103.98 seconds.
[01-18 01:26:16|INFO|cosmos1/models/diffusion/inference/world_generation_pipeline.py:367:generate] Finish generation
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[01-18 01:26:18|INFO|cosmos1/models/diffusion/inference/text2world.py:162:demo] Saved video to outputs/Cosmos-1.0-Diffusion-7B-Text2World_remote.mp4
[01-18 01:26:18|INFO|cosmos1/models/diffusion/inference/text2world.py:163:demo] Saved prompt to outputs/Cosmos-1.0-Diffusion-7B-Text2World_remote.txt
real 65m23.528s
user 9m52.950s
sys 22m8.084s
Consistency of ResultsVideos generated using a single Jetson Orin, as originally intended, are identical to those produced by splitting tasks across two devices, provided the same seed is used. All stochastic elements of the model remain on the main device, ensuring consistent outputs across different configurations.
ConclusionThis project demonstrates a straightforward technique to enhance the performance of Cosmos-1.0-Diffusion models by utilizing two Jetson AGX Orin devices. The method exploits the inherent parallelism in the model's architecture, making it effective for diffusion-based models while remaining simple to implement. However, this approach is limited to two devices. While it may not work with all AI models, it provides a practical solution for improving speed and efficiency with diffusion-based models like Cosmos-1.0.
Comments
Please log in or sign up to comment.