AI video generation is different from image generation in one important way: motion has physics, and not everything you can describe in text translates cleanly to video. Understanding this from the start saves a lot of frustration.
This guide walks through the complete workflow — from opening the video generator to downloading a finished clip — with specific prompts and the reasoning behind each decision.

The Two Generation Modes
Text-to-Video (T2V)
Write a prompt describing a scene. The model generates a video clip from scratch. You have no control over the starting frame — the model decides what the first frame looks like based on your description.
Use T2V when: you want to create a scene that doesn't exist as a photo, when you want the model to make compositional decisions, or when you are generating multiple different shots.
Image-to-Video (I2V)
Upload an image (photo or AI-generated) as the starting frame. Write a prompt describing what happens after that frame — what moves, how the camera behaves. The video begins from your exact image.
Use I2V when: you have a specific visual composition you want to animate, when you generated a strong image and want to bring it to life, or when visual consistency from the first frame matters.
For beginners, start with I2V. Generate an image first with Flux 2 or Midjourney, then animate it. This gives you control over both the starting composition and the motion — and produces more predictable results than T2V alone.
Step 1: Choose Your Model
| Model | Best for | Resolution | Duration |
|---|---|---|---|
| Kling 3.0 | Best overall quality, commercial video | 4K | 5–15s |
| Veo 3.1 | Physics, weather, environmental realism | 1080p | 5–15s |
| Seedance 2.0 | Music-synchronized video | 1080p | 5–15s |
| Wan 2.6 | Multi-shot narrative sequences | 1080p | 5–15s |
| Kling 2.5 Turbo | Fast generation, social content | 1080p | 5–10s |
| Hailuo 02 | Stylized, painterly aesthetic | 1080p | 5–10s |
For your first generation, use Kling 3.0. It is the most forgiving model — produces high-quality results across the widest range of subjects and prompt styles.
Step 2: Write Your Video Prompt
Video prompts work differently from image prompts. You are not just describing what is in the frame — you are describing motion, camera behaviour, and temporal change.
The video prompt structure:
[Scene description] + [what moves and how] + [camera behaviour] + [quality/mood]
Working examples by scene type:
Product shot:
A luxury perfume bottle on a dark marble surface,
slow camera orbit clockwise around the bottle,
soft dramatic side lighting catching the glass,
premium commercial aesthetic, smooth motion
Person in environment:
A woman walking through a flower market, slow motion,
golden afternoon light, slight shallow depth of field,
handheld camera feel, cinematic, warm tones
Establishing shot:
Wide aerial view of a coastal town at sunrise,
slow push forward over the water toward the shore,
warm golden light, cinematic quality
Abstract / atmospheric:
Smoke wisps rising slowly in a dark studio,
soft coloured light from the left — deep teal,
very slow camera push in, meditative pace
What Video Models Respond To
Camera movement language works well:
- Slow push in / push forward
- Pull back / zoom out
- Orbit clockwise / counter-clockwise
- Slow pan left / right
- Tilt up / down
- Static locked-off shot
Motion quality descriptors:
- Smooth, fluid motion
- Slow motion
- Subtle, minimal movement
- Dynamic, energetic
- Cinematic camera work
What to avoid in early prompts:
- Multiple fast-moving subjects
- Specific character actions (model struggles with "she waves and then turns around")
- Text overlays (add in post)
- Very long shot descriptions with many simultaneous elements
Step 3: Set Duration and Aspect Ratio
Duration: 5 seconds is the best starting point. It is enough to see the motion quality and evaluate whether the generation is working. Longer clips take more time to generate and cost more credits — generate at 5 seconds first, then extend if the result is strong.
Aspect ratio:
- 16:9 — YouTube, horizontal social content, website use
- 9:16 — Reels, TikTok, Shorts, Stories
- 1:1 — Some Instagram formats, square display contexts
Set aspect ratio before generating — like images, the proportions are locked after generation.
Step 4: The Image-to-Video Workflow (Recommended for Beginners)
This workflow produces more consistent results than pure T2V for most use cases:
-
Generate a base image with Flux 2 or Midjourney at the same aspect ratio you want for the video. Make sure the composition, lighting, and subject are exactly what you want for the first frame.
-
Open image-to-video mode in Cliprise. Upload your generated image as the starting frame.
-
Write the motion prompt — describe only what changes from the starting frame. The model already knows what the image looks like. Your prompt describes what happens next.
-
Generate at 5 seconds first. Review the clip — is the motion natural? Does the camera move as expected? Is the subject behaving the way you intended?
-
If good: Download or extend to longer duration.
-
If not: Adjust the motion description. The most common issue is too much complexity in the motion description. Simplify to one primary motion and regenerate.
See Image-to-Video Workflow: Complete Guide →
Step 5: Reviewing and Iterating
Watch the generated clip at full quality before downloading. Check:
- Motion naturalness — does movement look physically plausible?
- Subject consistency — does the subject maintain its appearance through the clip?
- Camera movement — does it behave as you described?
- End frame — AI video can degrade in quality in the last 1–2 seconds. Check the full clip, not just the first frame.
Common issues and fixes:
| Issue | Fix |
|---|---|
| Subject morphs or changes appearance | Simplify the motion — less movement usually means more stability |
| Unnatural physics (objects float, defying gravity) | Use Veo 3.1 instead of other models for physics-heavy scenes |
| Camera movement ignores the prompt | Be more specific: "camera orbits clockwise" vs "camera moves" |
| Clip degrades in final seconds | Trim the last 0.5–1 second in CapCut |
| Motion is too fast or too slow | Add "slow and deliberate" or "dynamic" to the prompt |
Step 6: Post-Production in CapCut
Raw AI video clips are ingredients, not finished content. In CapCut:
- Trim any weak frames at the start or end
- Add narration with ElevenLabs TTS audio
- Add background music at low volume
- Add text overlays or subtitles
- Colour grade with LUTs for consistency across clips
- Combine multiple clips into a complete sequence
Most professional AI video content on social media is 3–8 clips assembled in CapCut, not a single generated video.
Note
22+ video models on Cliprise. From Kling 3.0 at 4K to Seedance 2.0 with audio sync — all from one subscription. Try Cliprise Free →
Related Articles
Go deeper:
- AI Video Generation 2026: 22+ Models, Workflows →
- Image-to-Video Workflow: Complete Guide →
- Motion Control Mastery: Camera Angles & Movement →
- Best AI Video Models on Cliprise 2026 →
Model-specific guides:
- Kling 3.0 Complete Guide →
- Veo 3.1 Complete Tutorial →
- Seedance 2.0 Guide: Audio Sync →
- Wan 2.6 Complete Guide →
Comparisons: