Introduction
The Idea Behind the Hack
Why Jetson AGX Orin
Hardware Setup
Software Setup
Source code
Running the Project
Comparing the Results
Consistency of Results
Conclusion

Published January 18, 2025 © Apache-2.0

ParallelCosmos: Running Cosmos-1.0 on Two NVIDIA Orin AGX

Unleashing double the speed by harnessing dual-device parallelism for diffusion-driven video generation.

IntermediateFull instructions provided12 hours595

ParallelCosmos: Running Cosmos-1.0 on Two NVIDIA Orin AGX

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Ethernet Cable, Cat7

Samsung M.2 SSD

Software apps and online services

NVIDIA Cosmos

nvidia/Cosmos-1.0-Diffusion-7B-Text2World

NVIDIA JetPack SDK

Story

Introduction

The Cosmos-1.0-Diffusion-7B and 14B models, in both Text2World and Video2World versions, are NVIDIA’s advanced AI tools for generating videos. Text2World focuses on creating visuals from textual descriptions, while Video2World specializes in generating dynamic video sequences. This hack enables these models to run concurrently on two Jetson AGX Orin 64GB kits, improving efficiency by manually splitting tasks due to the lack of MPI support. The fixed batch size of 1 prevents splitting the batch itself, so this hack focuses on leveraging classifier-free guidance to separate conditional and unconditional predictions. Here’s how it works and how you can try it yourself.

The Idea Behind the Hack

Diffusion with classifier-free guidance

AI video generation often relies on diffusion models to iteratively refine random noise into visuals. For the Cosmos model, inference starts with a noisy representation of all 121 video frames in the latent space. These are refined over 35 real denoising steps, each reducing noise to reveal the video content.

Classifier-free guidance uses two predictions:

Conditional Prediction: Uses the text prompt to guide generation.
Unconditional Prediction: Adds diversity and prevents overfitting.

Tasks are split across two Jetson Orin devices, with Pyro5 handling remote calls and msgpack-numpy managing tensor serialization. This enables smooth parallel execution, processing all 121 frames in one pass. The hack ensures efficient operation despite the fixed batch size of 1 and lack of MPI support, demonstrating significant performance gains.

Why Jetson AGX Orin?

The Jetson AGX Orin is well-suited for this project due to its:

64GB of RAM: Sufficient for managing memory-intensive diffusion models smoothly.
Ampere GPU architecture: Offers up to 275 TOPS of AI performance for demanding tasks.
Energy efficiency: Suitable for edge computing with lower power consumption.

Using two Orin devices simplifies task distribution without requiring complex setups.

Hardware Setup

Get two Jetson AGX Orin 64GB boards.
Install M.2 SSDs on both Orins for faster storage.
Connect the boards with a Cat7 cable or similar high-speed network setup.
Power the devices with the included USB Type-C power bricks.

Software Setup

Prepare the Jetson AGX Orins:

Install NVIDIA JetPack 6.2 SDK on both devices to set up the required libraries and drivers.
Configure the ethernet network to connect both devices in the same network. I used 192.168.1.1 for main device and 192.168.1.2 for the helper device
It may require setting the MDIX on or use a hub/swith/crossover cable

sudo ethtool -s eno1 mdix on

Clone the Repository:

git clone https://github.com/andrei-ace/Cosmos-2xJetson.git

Download Models:

pip3 install 'huggingface_hub[cli,torch]' transformers
pip3 install --upgrade scipy
huggingface-cli login
PYTHONPATH=$(pwd) python3 cosmos1/scripts/download_diffusion.py --model_sizes 7B --model_types Text2World
PYTHONPATH=$(pwd) python3 cosmos1/scripts/download_t5.py

Build the custom Docker image:

sudo docker build . -t t2w_2xorin -f Dockerfile.t2w_2xorin

Make sure the correct CUDA computing capability is set (8.7 for Orin AGX):

ENV TORCH_CUDA_ARCH_LIST="8.7"
ENV NVTE_CUDA_ARCHS="87"

If these settings are not configured, you may encounter errors. This issue occurred for me on JetPack 6.2 when using older Jetson containers with cross-attention operations from transformer_engine_torch, which this model relies on.

RuntimeError: CUDA error: no kernel image is available for execution on the device

Test the Docker Container:

Start two docker containers on the main device:

sudo docker run --runtime nvidia -it --rm \
  --network=host \
  --volume ./checkpoints:/workspace/checkpoints \
  --volume ./outputs:/workspace/outputs t2w_2xorin

On the first container, run the remote helper:

PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/remote_helper.py

On the second container, test the remote helper:

PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/test_remote_helper.py

Check Results:

Using the 50W power mode: ~160 seconds per iteration.
Using MAXN power mode, each iteration of denoising takes approximately 105 seconds. While this aligns with reported experiences from other users, it may also produce over-current warnings. Given that the process involves two predictions per iteration (conditional and unconditional) and 35 denoising steps, generating one video takes roughly 2 hours in total.

1 / 4 • Running the remote invocation test

Copy the Cosmos-1.0-Diffusion-7B-Text2World weights from the checkpoints diretory to the second Jetson

Copy the t2w_2xorin docker image to the second Jetson by using docker save and docker load.

Source code

This function predicts the noise added to an image and is designed to offload work from the first Jetson Orin to the second.

@torch.no_grad()
def remote_denoise(self,                     
            noise_x_bytes: bytes,
            sigma_bytes: bytes,
            condition_type: str,
            condition_bytes: bytes
            ) -> bytes:                
    try:                
        start_time = time.time()  # Start timing
        noise_x = torch.tensor(m.unpackb(noise_x_bytes), device=self.model.tensor_kwargs["device"], dtype=self.model.tensor_kwargs["dtype"])
        sigma = torch.tensor(m.unpackb(sigma_bytes), device=self.model.tensor_kwargs["device"], dtype=self.model.tensor_kwargs["dtype"])
        
        if condition_type == "BaseVideoCondition":
            condition_dict = m.unpackb(condition_bytes)
            # Convert numpy arrays to tensors in condition_dict
            condition_dict = {
                key: torch.tensor(val, device=self.model.tensor_kwargs["device"], dtype=self.model.tensor_kwargs["dtype"]) 
                if isinstance(val, np.ndarray) else val
                for key, val in condition_dict.items()
            }
            condition = BaseVideoCondition(**condition_dict)
        elif condition_type == "VideoExtendCondition":
            condition_dict = m.unpackb(condition_bytes)
            # Convert numpy arrays to tensors in condition_dict
            condition_dict = {
                key: torch.tensor(val, device=self.model.tensor_kwargs["device"], dtype=self.model.tensor_kwargs["dtype"]) 
                if isinstance(val, np.ndarray) else val
                for key, val in condition_dict.items()
            }
            condition = VideoExtendCondition(**condition_dict)
        else:
            raise ValueError(f"Unknown condition type: {condition_type}")
                            
        x0 = self.model.denoise(noise_x, sigma, condition=condition).x0                        
        x0_bytes = x0.float().cpu().numpy()
        end_time = time.time()  # End timing
        print(f"Denoising took {end_time - start_time:.2f} seconds")
        return m.packb(x0_bytes)
    except Exception as e:
        import traceback
        print(f"Error in remote_denoise: {e}")
        print("Full stack trace:")
        print(traceback.format_exc())
        return None

This is the main step of denoising within the Cosmos-1.0 model. It enables running the unconditioned noise prediction on a second Jetson Orin if configured. By utilizing Pyro5 for remote procedure calls and msgpack-numpy for tensor serialization, this setup efficiently splits tasks between the devices, ensuring parallel execution.

def x0_fn(noise_x: torch.Tensor, sigma: torch.Tensor) -> torch.Tensor:
    log.info(f"Running for noise level: {sigma.item()}")
    torch.cuda.synchronize()  # Ensure all CUDA operations are complete
    start_time = time.time()  # Start timing

    if remote_denoiser_uri is not None:
        with ThreadPoolExecutor(max_workers=2) as executor:
            torch.cuda.synchronize()  # Ensure all CUDA operations are complete
            start_time_cond = time.time()
            cond_x0_future = executor.submit(local_thread, noise_x, sigma, condition)
            cond_x0_future.add_done_callback(lambda fut: log_time(fut, start_time_cond, "Conditioned"))

            torch.cuda.synchronize()  # Ensure all CUDA operations are complete
            start_time_uncond = time.time()
            uncond_x0_future = executor.submit(remote_thread, noise_x, sigma, uncondition)                    
            uncond_x0_future.add_done_callback(lambda fut: log_time(fut, start_time_uncond, "Unconditioned"))

            cond_x0 = cond_x0_future.result()
            uncond_x0 = uncond_x0_future.result()
    else:
        cond_x0 = self.denoise(noise_x, sigma, condition).x0
        uncond_x0 = self.denoise(noise_x, sigma, uncondition).x0            

    raw_x0 = cond_x0 + guidance * (cond_x0 - uncond_x0)            
    if "guided_image" in data_batch:
        # replacement trick that enables inpainting with base model
        assert "guided_mask" in data_batch, "guided_mask should be in data_batch if guided_image is present"
        guide_image = data_batch["guided_image"]
        guide_mask = data_batch["guided_mask"]
        raw_x0 = guide_mask * guide_image + (1 - guide_mask) * raw_x0

    torch.cuda.synchronize()  # Ensure all CUDA operations are complete
    end_time = time.time()  # End timing
    log.info(f"Denoising operation total time: {end_time - start_time:.2f} seconds.")
    return raw_x0

The tensors being transferred are relatively small, around 20MB in size when serialized as float32 with numpy. Given the computational intensity of the denoising steps, the overhead introduced by serialization and remote invocation through Pyro5 and msgpack-numpy is minimal. This allows the offloading process to remain efficient without significantly impacting performance.

Running the Project

Start the container on each Jetson Orin:

sudo docker run --runtime nvidia -it --rm \
  --network=host \
  --volume ./checkpoints:/workspace/checkpoints \
  --volume ./outputs:/workspace/outputs t2w_2xorin

Helper, second Jetson Orin

PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/remote_helper.py

Main Orin (192.168.1.2 is the IP of the second Jetson Orin)

PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. \
The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. \
A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, \
suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. \
The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of \
field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/text2world.py --checkpoint_dir checkpoints \
  --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World --prompt "$PROMPT" \
  --video_save_name Cosmos-1.0-Diffusion-7B-Text2World_remote \
  --disable_prompt_upsampler \
  --offload_guardrail_models \
  --offload_prompt_upsampler \
  --remote_denoiser_uri "PYRO:remote_denoiser@192.168.1.2:9090"

Optionally run the model on a single board

PYTHONPATH=$(pwd) python3 cosmos1/models/diffusion/inference/text2world.py --checkpoint_dir checkpoints \
  --diffusion_transformer_dir Cosmos-1.0-Diffusion-7B-Text2World --prompt "$PROMPT" \
  --video_save_name Cosmos-1.0-Diffusion-7B-Text2World \
  --disable_prompt_upsampler \
  --offload_guardrail_models \
  --offload_prompt_upsampler

Comparing the Results

Running the model on a single Jetson Orin board:

1 / 2

 [01-18 00:06:11|INFO|cosmos1/models/diffusion/model/model_t2w.py:222:x0_fn] Running for noise level: 0.0020000000949949026
[01-18 00:09:39|INFO|cosmos1/models/diffusion/model/model_t2w.py:254:x0_fn] Denoising operation total time: 207.94 seconds.
[01-18 00:10:23|INFO|cosmos1/models/diffusion/inference/world_generation_pipeline.py:367:generate] Finish generation
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[01-18 00:10:25|INFO|cosmos1/models/diffusion/inference/text2world.py:162:demo] Saved video to outputs/Cosmos-1.0-Diffusion-7B-Text2World.mp4
[01-18 00:10:25|INFO|cosmos1/models/diffusion/inference/text2world.py:163:demo] Saved prompt to outputs/Cosmos-1.0-Diffusion-7B-Text2World.txt

real	129m56.890s
user	18m8.579s
sys	47m53.788s

Running the model with the helper enabled: achieving twice the speed through efficient parallel processing.

1 / 2

[01-18 01:23:48|INFO|cosmos1/models/diffusion/model/model_t2w.py:222:x0_fn] Running for noise level: 0.0020000000949949026
[01-18 01:25:31|INFO|cosmos1/models/diffusion/model/model_t2w.py:191:log_time] Conditioned execution time: 103.60 seconds
[01-18 01:25:31|INFO|cosmos1/models/diffusion/model/model_t2w.py:191:log_time] Unconditioned execution time: 103.97 seconds
[01-18 01:25:31|INFO|cosmos1/models/diffusion/model/model_t2w.py:254:x0_fn] Denoising operation total time: 103.98 seconds.
[01-18 01:26:16|INFO|cosmos1/models/diffusion/inference/world_generation_pipeline.py:367:generate] Finish generation
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[01-18 01:26:18|INFO|cosmos1/models/diffusion/inference/text2world.py:162:demo] Saved video to outputs/Cosmos-1.0-Diffusion-7B-Text2World_remote.mp4
[01-18 01:26:18|INFO|cosmos1/models/diffusion/inference/text2world.py:163:demo] Saved prompt to outputs/Cosmos-1.0-Diffusion-7B-Text2World_remote.txt

real	65m23.528s
user	9m52.950s
sys	22m8.084s

Consistency of Results

Videos generated using a single Jetson Orin, as originally intended, are identical to those produced by splitting tasks across two devices, provided the same seed is used. All stochastic elements of the model remain on the main device, ensuring consistent outputs across different configurations.

single

running in parallel

Conclusion

This project demonstrates a straightforward technique to enhance the performance of Cosmos-1.0-Diffusion models by utilizing two Jetson AGX Orin devices. The method exploits the inherent parallelism in the model's architecture, making it effective for diffusion-based models while remaining simple to implement. However, this approach is limited to two devices. While it may not work with all AI models, it provides a practical solution for improving speed and efficiency with diffusion-based models like Cosmos-1.0.

ParallelCosmos: Running Cosmos-1.0 on Two NVIDIA Orin AGX

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

The Idea Behind the Hack

Why Jetson AGX Orin?

Hardware Setup

Software Setup

Source code

Running the Project

Comparing the Results

Consistency of Results

Conclusion

Code

https://github.com/andrei-ace/Cosmos-2xJetson

Credits

Andrei Ciobanu

Comments

Embed the widget on your own site

ParallelCosmos: Running Cosmos-1.0 on Two NVIDIA Orin AGX

ParallelCosmos: Running Cosmos-1.0 on Two NVIDIA Orin AGX

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

The Idea Behind the Hack

Why Jetson AGX Orin?

Hardware Setup

Software Setup

Source code

Running the Project

Comparing the Results

Consistency of Results

Conclusion

Code

https://github.com/andrei-ace/Cosmos-2xJetson

Credits

Andrei Ciobanu

Comments

Related channels and tags