The Kling model family from Kuaishou has iterated rapidly — from 1.0 through 1.6, 2.0, 2.1, 2.5, 2.6, and up to the current Kling 3.0. Each version addressed specific limitations. Kling 2.1, released May 2025, introduced one capability that the earlier versions lacked: start-and-end frame conditioning.
Rather than generating a clip that begins from a single reference image and lets the model decide where to end, Kling 2.1 lets you specify both frames. The model generates the motion sequence connecting them. This particular type of structural control remains useful even as newer Kling versions have surpassed it on raw visual quality.

What Kling 2.1 Is
Kling 2.1 is Kuaishou's image-to-video model released May 2025 as part of the Kling AI series. It builds on the Kling 2.0 architecture with enhancements focused on:
- Start-and-end frame conditioning — control over both the opening and closing frame of the generated clip
- Dynamic facial expressions — improved life-like facial animation for character content
- Realistic motion and physics simulation — via 3D spatiotemporal joint attention mechanism
- Multiple video generation from the same prompt — producing variants for comparison
Architecture context: Kling uses a diffusion-based transformer (DiT) architecture with Kuaishou's 3D Variational Autoencoder (VAE) that enables synchronous spatiotemporal compression. The full-attention mechanism captures complex motion and details across the clip duration, which is why the Kling family maintains character stability across generated clips better than many alternatives.
Start-and-End Frame Conditioning
This is Kling 2.1's most distinctive contribution relative to both its predecessors and the general capability set of other video models.
Standard image-to-video (one frame): Upload a starting image. Describe the motion. The model generates a clip that begins from your image and ends wherever the generation takes it. You control the start; the model controls the end.
Start-and-end frame conditioning (two frames): Upload both a starting image and an ending image. Describe the motion connecting them. The model generates a clip that begins from the first image and ends at the second image, with motion that creates a coherent sequence between the two states.
Why this matters:
For content with a specific visual transformation — a before-and-after, a specific camera move that must end in a defined position, a character that needs to arrive at a specific pose — the two-frame control gives you compositional precision that is otherwise impossible with standard single-start-frame generation.
Example use cases:
A product reveal: start frame shows the closed packaging, end frame shows the product open and displayed. The model generates the opening sequence.
A camera move with defined endpoints: start frame is a wide shot, end frame is a tight close-up on the subject. The model generates the push-in motion.
A transformation: start frame shows one state, end frame shows another (before/after styling, seasonal change in a scene). The model generates the transition.
Motion and Character Quality
Kling 2.1 inherits the Kling family's characteristic strength in character stability and motion coherence. The 3D spatiotemporal attention mechanism means that characters in generated clips maintain consistent appearance and move naturally — a person walking looks like a person walking, not a series of inconsistent frames.
Dynamic facial expressions was a specific focus in 2.1. Earlier Kling versions produced faces with limited expression range during motion. Kling 2.1 addressed this, producing more life-like facial animations during speech, emotion, and natural movement.
For content centered on human subjects — presenters, characters, performers — the facial animation quality in Kling 2.1 is noticeably better than Kling 1.x versions.
Where Kling 2.1 Fits in the Kling Family
| Model | Key strength | Resolution | Best for |
|---|---|---|---|
| Kling 2.1 | Start+end frame conditioning | 720p/1080p | Controlled two-frame transitions |
| Kling 2.5 Turbo | Speed | 1080p | Fast social content, iteration |
| Kling 2.6 | Character motion + native audio | 1080p | Quality social content with audio |
| Kling 3.0 | Max quality | 4K, 60fps | Commercial, hero shots |
Kling 2.1 is not the right model when maximum visual quality is the priority — Kling 3.0 significantly surpasses it on that dimension. It is the right model when start-and-end frame compositional control is the specific requirement.
Prompting for Kling 2.1
For standard image-to-video (one start frame):
[Subject] [action],
[camera movement],
[environment], [lighting],
cinematic quality, smooth motion
For start-and-end frame conditioning: The prompt describes the motion sequence connecting the two frames. Focus on what happens between start and end — not just the states, but the transition:
[Subject] transitions from [start state] to [end state],
[how the motion feels — smooth, deliberate, dynamic],
[camera behavior during the transition],
cinematic quality
The model uses both reference images as anchors and generates motion that is coherent with both.
Note
Kling 2.1 is on Cliprise alongside Kling 3.0, Kling 2.6, Kling 2.5 Turbo, and 40+ other video models. Try Cliprise Free →
Related Articles
Kling model family:
- Kling 3.0 Complete Guide →
- Kling 3.0 vs Kling 2.6: Upgrade Comparison →
- Kling 2.6 Motion Control vs Kling 3.0 →
Image-to-video guides:
Video generation guides:
Models on Cliprise:
