LTX 2.3: Lightricks Releases Open-Source 4K Video Model with Native Audio
Lightricks released LTX 2.3 on March 5, 2026, the latest update to its open-source audio-video foundation model. The release is significant in a field that has mostly been dominated by closed proprietary models: LTX 2.3 is the highest-performing open-weight video generation model available as of this writing, and the only open-source model generating native 4K video with synchronized audio output.
What LTX 2.3 Is
LTX 2.3 is a 22-billion-parameter Diffusion Transformer (DiT) model that generates video and audio simultaneously from a single architecture. It is built on the same foundation as LTX 2, released in October 2025, but with targeted improvements to visual quality, prompt adherence, and audio output.
The model ships in two variants:
- LTX 2.3-22B-dev — the full model in bf16 precision, suitable for fine-tuning, LoRA training, and research workflows
- LTX 2.3-22B-distilled — 8-step distilled version for faster inference with significantly lower memory overhead
Both variants are available on Hugging Face under the Apache 2.0 license, which permits commercial use without restriction for companies under $10 million in annual revenue. Larger commercial deployments require a license from Lightricks directly.
Key Improvements Over LTX 2
The jump from LTX 2 to 2.3 is not incremental. Three core components were rebuilt:
New VAE (variational autoencoder). The rebuilt encoder produces noticeably sharper output — textures, facial features, and small objects retain detail across the full frame. The improvement is most visible at higher resolutions where LTX 2 produced soft output.
4x larger text connector. Better understanding of complex prompts — multi-element scenes, specific camera movements, abstract concepts. Prompt drift (where the output diverges from the prompt in later frames) is significantly reduced.
Improved HiFi-GAN vocoder. Cleaner audio generation with stereo output at 24 kHz. The previous version's audio occasionally produced artifacts and silence gaps; the new vocoder produces more consistent ambient sound, environmental audio, and speech synchronization.
Technical Specifications
| Specification | LTX 2.3 |
|---|---|
| Parameters | 22 billion (dual-stream: ~14B video, ~5B audio) |
| Maximum duration | 20 seconds per clip |
| Resolution | Up to 4K native |
| Frame rate options | 24 FPS / 48 FPS |
| Aspect ratios | 16:9 (landscape), 9:16 (portrait — native vertical) |
| Audio | Synchronized generation, stereo 24 kHz |
| License | Apache 2.0 |
The native portrait (9:16) support is worth highlighting specifically: most video generation models produce landscape output and require cropping for vertical formats, losing resolution and compositional control. LTX 2.3 generates 1080×1920 natively, meaning the model actually composes for vertical framing instead of cropping a horizontal frame.
Generation Modes
LTX 2.3 supports seven distinct generation modes through separate endpoints:
- Text-to-video (standard and fast variant)
- Image-to-video (standard and fast variant)
- Audio-to-video (provide audio, generate matching visuals)
- Extend-video (continue an existing clip)
- Retake-video (regenerate specific sections of an existing clip)
The extend-video and retake-video modes are particularly useful for production workflows — they allow clip-by-clip construction of longer sequences and section-level regeneration without discarding an entire generation.
Performance Benchmarks
LTX 2.3 is ranked as the top open-source video model on the Artificial Analysis benchmark as of its release date. In the broader leaderboard including closed models, it ranks behind Kling 3.0 (Elo 1,244) and Runway Gen-4.5 (Elo 1,225) on perceptual quality — but the gap in resolution (4K vs the 1080p standard for most models) and the cost differential (local or API vs subscription-only) change the comparison for many workflows.
On speed: benchmarks show LTX 2.3 is approximately 18x faster than Wan 2.2 at equivalent quality settings.
Hardware Requirements
Running LTX 2.3 locally requires an NVIDIA GPU. Full 4K generation with the fp16 base model requires approximately 44GB VRAM. Quantized variants (FP8) reduce this to around 24GB, making the model viable on consumer hardware like the RTX 4090 or RTX 5090.
For cloud-based generation without local hardware, LTX 2.3 is available on fal.ai and other API providers. Desktop use is also supported through LTX Desktop Beta — a free, open-source desktop application released alongside the model for Windows (NVIDIA GPU required) and Mac (API mode only via Apple Silicon).
What This Means for AI Video Production
The arrival of a 4K open-source video model with native audio changes the cost structure for independent creators and studios who need customization options. The Apache 2.0 license allows fine-tuning on custom datasets, integration into commercial pipelines, and self-hosting — capabilities that proprietary models cannot offer.
LTX 2.3 does not replace high-end proprietary models like Kling 3.0 or Sora 2 for raw perceptual quality at standard resolutions. But for workflows where resolution, duration (20 seconds), cost efficiency, or model customization are the primary requirements, it is now the strongest available option in the open-weight category.
Related on Cliprise
If you are working with AI video generation on Cliprise, the following comparisons and guides are relevant context:
- Sora 2 vs Kling 3.0 vs Veo 3.1: Which Model for Your Use Case? →
- Best AI Video Generator 2026: Real Tests, Real Costs, Real Verdict →
- Seedance 2.0 Complete Guide: Audio-Video Joint Generation →
- AI Video Generation 2026: 22+ Models, Workflows, and What Actually Works →
Cliprise currently provides access to Seedance 2.0, Kling 3.0, Veo 3.1, Sora 2, and Runway Gen-4 Turbo for cloud-based video generation across image, video, and audio workflows.
Workflow tested on Cliprise with Seedance 2.0, Kling 3.0, and 47+ AI models. Sources: Lightricks official release, fal.ai LTX-2.3 documentation, Artificial Analysis benchmark data.