Guides

Wan Animate Complete Guide: Character Animation and Replacement From a Single Image on Cliprise

Wan Animate is Alibaba's open-source character animation model. Feed it a character image and a reference video: it animates the character with the reference's movement and expressions, or replaces the reference character entirely while matching the scene's lighting. This guide covers how it works, how to use it, and when to choose it over Runway Aleph, ByteDance Omni Human, or dedicated generation models on Cliprise.

11 min read

Wan Animate addresses a specific production problem that most AI video models do not solve.

You have a character. You want that character to perform a specific sequence of movements, expressions, and actions. You do not want to describe every detail of the motion in a text prompt and hope the model interprets it correctly. You want to show the model what you want the character to do, in concrete video form, and have the model apply that performance to your character.

This is what Wan Animate does. You provide a character image and a reference video. The model uses the reference video's skeleton motion and facial expressions to drive the character image, producing a new video where your character performs the reference's actions. Alternatively, it replaces the reference character with your character entirely while preserving the scene's lighting and environmental context.

This is a different category of capability from text-to-video generation or general video editing. It is character animation through reference performance, and it is worth understanding on its own terms.


What Wan Animate Actually Is

Wan Animate is part of Alibaba's Wan 2.2 series, released under open-source licensing. It is a 14-billion parameter model built for character animation and character replacement. Wan Animate separates body motion, facial expression, and scene integration instead of treating animation as one undifferentiated task. That design helps explain why it preserves identity during motion transfer more reliably than many prompt-led video models.

The two operational modes are distinct and worth understanding separately:

Animation Mode takes a character image as input and a reference video as a performance guide. The model animates the character image, making it perform the actions and expressions shown in the reference video. The output is a new video of your character doing what the reference character did.

Replacement Mode takes a character image and an existing video that contains a different character. The model replaces the character in the video with the one from your image, preserving the scene's lighting, camera movement, and environmental context. The output is the original scene with a different character in it.

Both modes are built on the same underlying framework: spatially-aligned skeleton signals for body motion, implicit facial feature extraction for expression, and a Relighting LoRA module for environmental integration in replacement scenarios.


How the Architecture Works

At a high level: the reference video supplies motion and expression cues; the model maps them onto your still character while trying to keep identity stable. Replacement Mode adds relighting so the inserted character picks up the plate's lighting instead of looking pasted in.

For Replacement Mode specifically, the Relighting LoRA module handles environmental integration. When you replace a character in a scene, the lighting and color tone of the original environment should apply to the new character. Otherwise the composite looks pasted rather than shot. The Relighting module applies the scene's lighting characteristics to the new character while preserving their visual identity. That is what makes Replacement Mode read as composited rather than obviously swapped.

The output resolution for Wan Animate is 720p at 24fps. Generation time varies depending on input length and complexity, typically 3 to 8 minutes for a short clip.

Input specifications:

  • Video files: less than 200MB, minimum side resolution greater than 200 pixels, maximum side less than 2048 pixels
  • Video duration: 2 to 30 seconds
  • Aspect ratio: between 1:3 and 3:1
  • Image files: less than 5MB, supporting JPG, PNG, JPEG, WebP, BMP formats

Best input setup for Wan Animate

  • Clean reference video. Stable exposure, minimal motion blur on the performer, and a clear view of the body for skeleton-driven moves.
  • Similar framing between character still and reference (full body vs full body, portrait vs portrait).
  • Visible face when you care about expression or dialogue; mouth occlusion breaks lip-driven output fast.
  • Calm background on the reference. Busy crowds or heavy occlusion make tracking noisier.
  • Well-lit character image with the subject readable; extreme crop or silhouette shots are higher risk.

What Wan Animate Is Particularly Good At

Character performance from reference video. The core use case. You have a character design or a photo of someone (only with permission and appropriate rights), and you want that character to perform specific movements or expressions. Record or source a reference video of someone performing what you want, and Wan Animate applies that performance to your character.

Character replacement in existing footage. Wan Animate is built for swapping characters in complex scenes with moving cameras and uneven lighting, with relighting aimed at integrating the new subject into the plate.

Lip sync driven by reference speech. When the reference video includes speech, Wan Animate transfers mouth movement and facial articulation toward the character image. Quality varies with reference clarity and framing; compare a short test clip against your delivery standard before you batch.

Dance and choreography transfer. Complex body movement (dance, athletic choreography, stylized movement) transfers through the skeleton alignment system. The body motion often reads as more coherent than prompt-only animation, especially when the reference performance is clean.

Longer sequences with character identity preservation. The model's design targets appearance consistency across the clip, which is a common failure mode for prompt-only video generation.


Reference-driven animation makes policy and ethics concrete.

Use performances you have rights to. The reference video should be footage you filmed, licensed, or otherwise have permission to use for derivative generation. Do not pull strangers' clips from the open web and drive a commercial character without clearing performance rights.

Do not use Replacement Mode to impersonate real people without consent. Swapping a face or body into existing footage can create misleading depictions. Match your use to platform rules, client contracts, and local regulation.

Disclose synthetic or heavily assisted content when your channel or client requires it. Wan Animate output is a composite pipeline, not raw camera negative.

If you are unsure, treat Wan Animate like VFX: clear the performance chain before you ship.


Where Wan Animate Is Not the Right Choice

Text-to-video generation. Wan Animate is not a generation model. It animates existing characters using reference video. If you need to create new video content from a text prompt, Wan 2.6, Kling 3.0, or Veo 3.1 are the relevant options inside the AI video generator.

Single-image-to-video without reference. If you have a character image but no reference video, and you want motion implied by a text description, traditional image-to-video models like Hailuo 2.3 or Wan 2.7 I2V are better suited.

In-context editing of existing footage. If your goal is to modify existing video content by adding objects, changing lighting, or generating new camera angles, Runway Aleph is the dedicated tool for that use case. Wan Animate handles character substitution and performance transfer; Aleph handles broader scene edits.

4K output delivery. Wan Animate's output is 720p. For content requiring higher resolution, plan upscaling in post or a different pipeline.

Very short or very long clips. The supported range is 2 to 30 seconds. Within that range, the model matches its design assumptions. Outside of it, expect friction.


Choosing: Wan Animate, Runway Aleph, or ByteDance Omni Human

Wan Animate fits when you already have a reference performance on video and need that exact timing, pose, or dialogue motion applied to a still character, or when you must replace a performer in a plate while keeping camera and lighting.

Runway Aleph fits when the job is scene-level editing (objects, relight, reframing) rather than "make this still character copy this take." Aleph preserves the original scene grammar; Wan Animate preserves the scene but swaps or drives the character.

ByteDance Omni Human fits when you start from a single reference image plus script or audio and want a talking or presenting human without filming a full reference performance. Omni Human is audio- and script-forward; Wan Animate is reference-video-forward.

Rule of thumb: filmed reference performance you want copied onto a design, use Wan Animate. Shot rebuild or magic-edit style changes, use Aleph. Single-image presenter from speech, use Omni Human.


Production Workflows That Use Wan Animate

Virtual performer from concept art. A game studio has concept art for a character but no animation pipeline built for them. They film an actor performing the character's intended movements, use Wan Animate to transfer those movements to the concept character, and generate animation footage for pre-visualization or marketing without building a full rigging and animation pipeline.

Brand mascot in dynamic scenarios. A brand has a static mascot design used in print and digital marketing. They film a human performing a dance, a sports sequence, or a dramatic scene, then use Wan Animate to make the mascot perform the same actions. The mascot gets dynamic content without requiring a full animation production.

Character replacement for pre-visualization. A production team is planning a scene and wants to see how a specific character would look in existing reference footage. They use Wan Animate in Replacement Mode to drop their character into the reference scene, preserving the lighting and environmental context while showing the production how the character would fit.

Content localization with different performers. An ad campaign features a performer delivering a message in one market. For a different market, a different performer records the localized version. Wan Animate can transfer the original performance's energy and timing to the new performer's face, maintaining campaign consistency across localizations.

AI influencer content production. For creators building virtual characters or AI-generated personas, Wan Animate provides a direct path from character design to dynamic video content. Film the intended action with any permitted person as reference, apply it to the virtual character. This is often faster than generating each piece of content from scratch with text alone.


How Wan Animate Compares to Alternatives

Wan Animate vs Runway Aleph. Both touch video modification, but differently. Aleph does in-context editing while preserving the original scene and characters. Wan Animate does character-specific work: animate a new character using reference performance, or replace an existing character while keeping the scene intact. For character workflows specifically, Wan Animate is the direct fit. For broader video editing, Aleph.

Wan Animate vs Hailuo 2.3 image-to-video. Hailuo 2.3 I2V takes a starting image and generates motion from a text prompt. Wan Animate takes a character image and drives it with motion from a reference video. If you have a reference video of the exact performance you want, Wan Animate. If you only have a descriptive prompt of what the character should do, Hailuo 2.3 I2V.

Wan Animate vs Wan 2.6 and Wan 2.7. Different parts of the Wan family optimized for different tasks. Wan 2.6 and Wan 2.7 are for generation from scratch. Wan Animate is for character performance transfer. They complement each other within the same Alibaba model family.

The AI video generation complete guide for 2026 covers how all current video models fit into the broader competitive landscape with specific use case recommendations.


Prompting and Input Considerations

Wan Animate is not a traditional prompt-based model. The primary control is the inputs you provide: the character image and the reference video. Prompt engineering is secondary to reference quality.

Character image quality matters. Use clear, well-lit images where the character is clearly visible, ideally facing forward or at a three-quarter angle. Full-body or half-body compositions work better than extreme close-ups. Portraits, illustrations, and cartoon characters all work: the model handles a range of input styles.

Reference video quality matters. The cleaner the reference video, the better the output. Clear view of the performer, good lighting on the face, minimal background clutter that could confuse skeleton extraction. For dialogue shots, a clear view of the mouth and natural speech rhythm usually help.

Match input and reference framing. If your character image is a half-body shot, use a half-body reference video. If your character is stylized or illustrated, a reference video in a similar framing and scale produces better alignment than dramatically different compositions.

Consider input aspect ratios. The model supports 1:3 to 3:1 aspect ratios. Match input and output expectations: if you want a vertical reel, input should be vertical. If you want horizontal widescreen, input horizontal.

For Replacement Mode specifically, the scene you are replacing a character into should have clear shots of the character being replaced. The cleaner the original character is visible, the more reliably the model handles the substitution.


Getting Started With Wan Animate on Cliprise

Wan Animate is available on Cliprise through the AI video generator. Your existing credits apply. No separate Alibaba or Tongyi Lab account required.

For the complete Wan family context, the Wan 2.6 complete guide covers the multi-shot narrative generation model, and the Wan 2.7 video release coverage covers the current generation model lineup. The Wan 2.5 complete guide covers the preceding generation. For scene edits that are not character-transfer jobs, keep Runway Aleph in the same tab group.


FAQ

What does Wan Animate do? Wan Animate animates a character image using motion and expressions from a reference video. It has two modes: Animation Mode applies reference motion to your character image, and Replacement Mode swaps a character in an existing video with one from your image while preserving the scene's lighting.

Does Wan Animate generate video from text prompts? No. Wan Animate is specifically for character animation using reference video. For text-to-video generation, Wan 2.6, Kling 3.0, or Veo 3.1 are the relevant models.

What resolution does Wan Animate output? 720p at 24fps. For higher resolution delivery, upscaling in post-production or using a different pipeline for the final render is the standard approach.

How long can the reference video be? 2 to 30 seconds. The model works within this range; shorter clips process faster and longer clips require more compute.

What formats does Wan Animate accept? Video reference: MP4 and MOV formats, less than 200MB. Character image: JPG, PNG, JPEG, WebP, BMP formats, less than 5MB.

Is Wan Animate open source? Yes. Alibaba released Wan Animate under open-source licensing as part of the Wan 2.2 family. The model weights are available on Hugging Face, GitHub, and ModelScope for research and development.

Is Wan Animate available on Cliprise? Yes. Wan Animate is in the Cliprise video model lineup. Access it through your standard Cliprise account and credits.

Ready to Create?

Put your new knowledge into practice with Wan Animate Complete Guide.

Open AI Video Generator
Featured on Super Launch