Name: Cliprise
Author: Cliprise

When we covered the Wan 2.7 Image launch on April 1, we noted that the release was image-first and that the Wan video stack remained at 2.6. That was true on April 1. It was not true for long.

By April 6, Alibaba's Tongyi Lab had released the full Wan 2.7 suite - including text-to-video, image-to-video, reference-to-video, and instruction-based video editing. The four models became available on Together AI starting April 3, with broader API availability following through the week. All of it under Apache 2.0. All of it commercially usable without a platform subscription. All of it at pricing that makes high-volume production workflows economically viable in ways they were not before.

This is not an update to what we covered previously. It is a different product announcement that happens to share a version number. The image capabilities were a significant improvement. The video suite, taken as a whole, represents something more fundamental: Alibaba has shipped the most complete open-source video production stack available anywhere, in a single release.

The Four Models and What Each Does

The Wan 2.7 video suite is four distinct models under one roof. Each addresses a different step in the video production pipeline, and the fact that all four are available together from day one is the architecture decision that makes this release structurally different from what came before.

Wan 2.7 T2V (Text-to-Video) is the generation entry point. Accepts a text prompt, produces a clip at 720p or 1080p with durations selectable from 2 to 15 seconds. Multi-shot narrative control is built in - prompt language that implies scene structure produces appropriately sequenced output rather than a single locked shot. Optional audio input allows the model to synchronize motion to a provided audio track during generation rather than as a post-processing step. This is architecturally different from bolting on audio afterward: the motion and the audio are generated in the same pass, with the model understanding the audio timeline as a conditioning signal. At the Together AI endpoint Wan-AI/wan2.7-t2v, pricing starts at $0.10 per second for serverless inference.

Wan 2.7 I2V (Image-to-Video) takes a still image and generates motion from it. The model produces physically plausible movement - materials behave as they should, lighting responds correctly to the motion, object interactions follow physics - rather than the jittery surface-level animation that characterized earlier image-to-video approaches. Resolution up to 1080p, clip length up to 15 seconds. The key capability addition over previous Wan I2V models: first-and-last-frame control. You provide the opening frame and the closing frame, and the model generates the motion between them. This sounds like a small feature. In production it is a significant workflow change.

Without first-and-last-frame control, generating a video that ends in a specific visual state requires either accepting whatever the model produces and hoping it works, or running many iterations until you get the result you want. With it, you define the narrative arc - start here, end there - and the model fills in the movement. For product videos that need to land on a specific hero shot, for character sequences that need to connect to the next scene, for any content where the ending frame matters as much as the opening, this control removes the trial-and-error loop that makes AI video generation expensive at scale.

Wan 2.7 R2V (Reference-to-Video) is the model that does not have a direct equivalent in the current Cliprise lineup and represents the most genuinely novel capability in the suite. It accepts up to five reference inputs simultaneously - any combination of images, video clips, and audio files - and uses them to generate a new video that maintains the identity, style, and audio characteristics specified in those references.

The specific implementation uses explicit character binding: you label each reference as image1, video1, audio1, and so on, then reference those labels in the generation prompt - "character from image1 performing the action shown in video1, speaking in the voice from audio1." The model processes these references as simultaneous conditioning signals rather than sequential instructions, which means the generated output reflects all of them coherently rather than defaulting to whichever reference was processed last.

The voice cloning capability within R2V is notable. Supply an audio reference, and the model generates character voice in that register - not post-processing the voice onto a silent generation, but generating video content where the character's motion and speech are synchronized to the reference voice from the beginning. For branded content with a consistent spokesperson, for training videos where audio consistency across multiple clips matters, for any workflow that requires a recognizable voice alongside recognizable appearance, R2V addresses a production challenge that most AI video tools have not touched.

Wan 2.7 VideoEdit is instruction-based editing on existing video. Supply a clip and a text description of what you want changed - the background, the lighting, a character's clothing, the time of day, the color treatment - and the model applies those changes while preserving the underlying motion structure of the original. This is the same category of capability that Runway Aleph introduced - in-context editing rather than regeneration. Where Aleph uses 3D spatial reconstruction to achieve this, Wan 2.7 VideoEdit uses a different approach, and the results across the two tools vary by use case in ways that independent testing is still mapping.

One billing detail worth noting upfront: VideoEdit charges for input duration plus output duration. An edit applied to a 5-second clip produces a 10-second billing event. This is consistent with how the processing actually works, but it means the cost structure for high-volume editing pipelines needs to account for this rather than assuming standard per-second pricing.

The Architecture: Why 14 Billion Active Parameters Matters

Wan 2.7 is a 27-billion-parameter model. But during inference, only 14 billion of those parameters are active at any given moment. This is a Mixture-of-Experts (MoE) architecture - a design pattern that has appeared repeatedly in the best-performing recent models because it offers a specific trade-off: the representational capacity of a large model at the inference cost of a much smaller one.

The 128 expert networks in Wan 2.7 each specialize in different aspects of video generation. For any given generation, the model routes computation to the 8 most relevant experts for that specific input. The routing happens dynamically based on the content - a generation involving human facial expression routes to different experts than a generation involving complex fluid physics, which routes to different experts than a generation involving rapid camera motion. Each expert has been trained to handle its specialized domain well. The model as a whole handles everything well because the right combination of experts handles each specific case.

The practical consequence: Wan 2.7 at 14 billion active parameters runs at inference speeds comparable to a 14B dense model while achieving quality that reflects 27B of total learned representation. For the open-source deployment case - running Wan 2.7 on local hardware - this means the model fits in less GPU memory than you might expect from a 27B model. The Wan 2.5 complete guide covers the architectural lineage in more detail for those interested in how the Wan series arrived at this design.

How This Changes the Wan vs Wan 2.6 Question

The question "should I use Wan 2.6 or Wan 2.7" is now answerable with some specificity.

Wan 2.6 was primarily a multi-shot narrative release. It introduced shot marker prompt syntax, improved audio-visual synchronization, and extended clip duration to 15 seconds. The 2.6 architecture excelled at the specific use case of narrative video - telling a story across multiple shots within a single generation, with appropriate camera positioning and scene transitions. If multi-shot narrative structure is your primary use case and you are currently satisfied with 2.6's quality, there is no urgent reason to migrate.

The cases where 2.7 is clearly better:

Any workflow requiring first-and-last-frame control. Wan 2.6 does not have this. If you need to specify where a clip ends, you are using 2.7.

Any workflow using reference-based character or voice consistency. Wan 2.6 has no R2V equivalent. The five-reference R2V capability is new with 2.7.

Any workflow that involves editing existing video rather than generating new footage. Wan 2.6 has no video editing capability. VideoEdit is new with 2.7.

1080p output as standard. Wan 2.6 generates at 720p natively. Wan 2.7 generates at 1080p. For deliverables where resolution matters - broadcast, large-format display, any context where the video will be shown on a large screen - this is a relevant difference.

Native audio sync from the model rather than post-processing. Wan 2.6 supported audio in a limited way. Wan 2.7's audio conditioning at the generation level produces better temporal synchronization.

For users currently running Wan 2.6 workflows, the practical migration path is: evaluate R2V for any character-consistency requirements you have been working around with other tools, test VideoEdit against Runway Aleph for your specific editing use cases, and switch T2V and I2V to 2.7 endpoints where the 1080p resolution and first-and-last-frame control are useful.

The Open-Source Positioning and Why It Matters

Every major AI video generation model in the current Cliprise lineup - Kling 3.0, Veo 3.1, Hailuo 2.3, Runway Aleph - is a closed, hosted model. You access it through an API or a platform. You do not have the weights. You cannot modify it, fine-tune it for your specific use case, or deploy it in an environment where data cannot leave your infrastructure.

Wan 2.7 is different. Apache 2.0 means commercial use, fine-tuning, modification, and redistribution are all permitted without restriction. The weights are downloadable. You can run it locally. You can fine-tune it on your own data. You can deploy it inside infrastructure that never sends data to a third party. For organizations with data governance requirements - healthcare, legal, enterprise content teams with strict IP policies - this is not a minor feature. It is the difference between the tool being usable at all and not being usable.

The community ecosystem that open-source licensing enables is also a compounding advantage. Within weeks of the Wan 2.5 and 2.6 releases, the community had produced fine-tuned variants for specific aesthetics, integration guides for ComfyUI and other workflow tools, and optimization techniques that improved generation speed on consumer hardware. Wan 2.7 enters a more mature ecosystem with more people capable of extending it immediately.

Wan 3.0 has already been pre-announced. Alibaba has confirmed 60 billion parameters, targeting 4K resolution and 30-second generation, expected mid-2026 under Apache 2.0. If those specifications hold, it would be the first open-source model at the quality and resolution tier currently occupied only by the premium closed models. The prompting techniques and workflow patterns built on Wan 2.7 carry forward to 3.0.

Wan 2.7 on Cliprise

The Wan series has been central to the Cliprise lineup through the 2.5 and 2.6 generations. Wan 2.5 remains available for workflows where its specific characteristics are the right fit. Wan 2.6 is the primary recommendation for multi-shot narrative video.

Wan 2.7 integration is being evaluated and tracked. The combination of 1080p output, first-and-last-frame control, reference-based character consistency, and instruction editing represents capabilities that address directly the workflow gaps that Cliprise users have been navigating across multiple tools. An update on timeline and availability will follow as integration work progresses.

For the complete picture of where Wan 2.7 fits in the current competitive landscape, the AI video generation guide for 2026 covers all major models in their current positions. The best AI video models comparison is maintained to reflect current availability.

For creators tracking Alibaba's broader video ecosystem, HappyHorse 1.0 is also available on Cliprise. While Wan remains important for Alibaba video generation and multi-shot experimentation, HappyHorse is especially relevant for short-form marketing clips, image-to-video, product motion, reference-driven video, and editing workflows.

For developers who want to work with Wan 2.7 directly, Together AI currently hosts the full four-model suite at the endpoints Wan-AI/wan2.7-t2v, Wan-AI/wan2.7-i2v, Wan-AI/wan2.7-r2v, and Wan-AI/wan2.7-videoedit. The official Alibaba documentation and model weights are available through Alibaba Model Studio and Hugging Face.

The Wan series has been, since Wan 2.1, the benchmark for what open-source AI video is capable of. Wan 2.7 extends that position into territory - instruction editing, voice-cloned character video, first-and-last-frame control - that was previously the exclusive domain of proprietary tools. Whether the closed models can maintain a quality gap that justifies the restrictions that come with them is becoming a harder case to make with each successive Wan release.

Wan 2.7 Video Suite Is Here: Alibaba Just Shipped the Most Complete Open-Source Video Production Stack in Existence

The Four Models and What Each Does

The Architecture: Why 14 Billion Active Parameters Matters

How This Changes the Wan vs Wan 2.6 Question

The Open-Source Positioning and Why It Matters

Wan 2.7 on Cliprise

Ready to Create?

Wan 2.7 Video Suite Is Here: Alibaba Just Shipped the Most Complete Open-Source Video Production Stack in Existence

The Four Models and What Each Does

The Architecture: Why 14 Billion Active Parameters Matters

How This Changes the Wan vs Wan 2.6 Question

The Open-Source Positioning and Why It Matters

Wan 2.7 on Cliprise

Related Coverage

Ready to Create?