============================================================
 nat.io // BLOG POST
============================================================
TITLE:    Storyboarding with ControlNet: Escaping the Portrait Trap
DATE:     February 2, 2026
AUTHOR:   Nat Currier
TAGS:     AI, Image Generation, Tutorial
------------------------------------------------------------
> If you can't sketch it, you can't prompt it.

We all know the "AI Stare." It's that default, medium-closeup portrait of a beautiful face looking directly into the camera. It's technically perfect, and completely boring. The reason AI models default to this is simple: their training data is full of portraits. If you want to create a real scene-a fight, a chase, a complex interaction-you have to fight the model's lazy tendencies. You need **ControlNet**.

If you are unfamiliar with the stack, start here: - [ControlNet paper](https://arxiv.org/abs/2302.05543) - [ControlNet docs in Diffusers](https://huggingface.co/docs/diffusers/main/en/using-diffusers/controlnet) - [OpenPose project](https://github.com/CMU-Perceptual-Computing-Lab/openpose) The short version is not glamorous, but it works. Text prompting alone is weak for precise action geometry. Treat pose, edge, and depth controls as shot-planning tools, not optional extras. Start with rough storyboards, lock readability first, then escalate quality. Better models in 2026 still need structural constraints if you want dynamic composition to be repeatable.

[ The Problem with Text Prompting for Action ]
------------------------------------------------------------

Try prompting simple text for a specific action: *"A man jumping over a gap between two buildings, legs extended, arms back, low angle."* You might get lucky. But more likely, you'll get a man floating weirdly in the air, or a close-up of a face with a building in the background. Text is terrible at describing *geometry* and *physics*.

[ Plain-English Glossary ]
------------------------------------------------------------

Quick definitions for common action-composition terms:

- **Pose conditioning:** giving the model a body skeleton target.
- **Canny edge map:** a simplified outline of major edges in an image.
- **Depth conditioning:** near/far spatial guidance that preserves volume.
- **Silhouette readability:** whether the action is understandable at a glance.
- **Blocking:** where characters and objects are placed in relation to camera.

These are not advanced tricks. They are basic directing tools translated into AI workflows.

[ The Solution: ControlNet OpenPose ]
------------------------------------------------------------

ControlNet is an extension for Stable Diffusion that allows you to condition the generation on an input image. The most powerful module for action is **OpenPose**. It extracts a "skeleton" from a reference image and forces the AI to map the character to that exact skeleton.

> Step 1: Find or Create a Reference

You don't need a high-quality image. You just need the *pose*.
*   **Google Images:** Search for "parkour jump reference," "martial arts kick," "climbing pose."
*   **3D Posers:** Use a tool like [MagicPoser](https://magicposer.com/) or [Daz3D](https://www.daz3d.com/) to pose a dummy exactly how you want.
*   **Your Own Body:** Set up a timer on your phone and act it out yourself. (No one has to see the original photo!)

> Step 2: The "Canny" Sketch Trick

For even more control than just the skeleton, use the **Canny** or **Lineart** preprocessor.

Sketch the composition on a napkin. Draw stick figures. Draw squares for buildings. Scan it. Feed this crude sketch into ControlNet Canny. The AI will use your lines as the "edges" of the final image. It turns your scribble into a masterpiece while keeping your exact composition.

If you need implementation details, the [Diffusers ControlNet guide](https://huggingface.co/docs/diffusers/main/en/using-diffusers/controlnet) includes practical examples.

> Step 3: Add Depth For Better Spatial Readability

OpenPose handles limbs. Depth handles volume.

If your action shot has jumps, climbing, or multi-plane movement, adding depth guidance prevents flat "paper cutout" results. It also helps preserve believable distances between foreground, subject, and background.

> Step 4: Sequence Before Polish

Do not perfect frame one before knowing frame two and three.

Build a micro-sequence: 1. setup (anticipation) 2. action peak 3. consequence/recovery This is how storyboards create momentum instead of isolated cool images.

[ Gallery of Dynamic Action ]
------------------------------------------------------------

Here are three examples of high-octane action shots that would be nearly impossible to get with text prompting alone. These were controlled using OpenPose and Canny edge detection to force the specific perspective and limb placement.

[Image gallery: 3 related images are displayed with captions.]

[ Prompt Pattern For Action Storyboards ]
------------------------------------------------------------

Use this structure:

`[shot size], [camera angle], [pose intent], [movement direction], [environment constraints], [lighting], [impact cue]` Example:

`wide shot, low angle, parkour leap with extended trailing leg, left-to-right motion across rooftop gap, foreground railing and distant skyline, overcast dusk with practical city lights, rain spray kicked from shoe sole` This keeps prompts short while preserving geometry-critical information.

[ 30-Minute Action Storyboard Drill ]
------------------------------------------------------------

If you are a beginner, run this exercise:

1. Pick one action (jump, kick, climb, fall, dodge).
2. Create a three-frame sequence: setup, peak action, aftermath.
3. Build simple stick-figure pose references for each frame.
4. Generate each frame with matching lens language.
5. Compare all three and fix only silhouette/readability first.

This single drill teaches more than random prompt experimentation for hours.

[ Why Action Frames Fail Even With Good Pose Data ]
------------------------------------------------------------

A correct skeleton is necessary, but not sufficient. Action can still look fake when:

- the center of gravity is impossible
- contact points (feet, hands, walls) do not align with surfaces
- motion direction conflicts with camera perspective
- scene scale cues are missing

If this happens, adjust environment lines and camera language before changing the pose.

[ Action Prompt Skeleton ]
------------------------------------------------------------

For repeatable results, use this baseline:

`[shot size], [camera angle], [pose reference], [movement direction], [contact points], [environment lines], [lighting], [impact detail]` Example:

`medium-wide shot, low angle, openpose reference of flying side kick, right-to-left motion, support foot planted on wet concrete, diagonal railing lines guiding eye toward target, overhead sodium-vapor streetlight, water spray at point of impact`

[ Directing, Not Just Generating ]
------------------------------------------------------------

When you use ControlNet, you stop rolling the dice and start directing. You decide where the hand goes. You decide the angle of the leg. You decide the horizon line. It adds steps, but those steps are the difference between "making an image" and "telling a story."

[ Common Failure Modes ]
------------------------------------------------------------

The most common failure patterns in this section are:

- **Pose-only dependence:** correct limbs, broken environment perspective.
- **No silhouette check:** action looks complex but reads as visual clutter.
- **Over-detail too early:** style polishing before staging is stable.
- **Inconsistent camera language:** every frame feels like a different film.

[ A Practical Production Loop ]
------------------------------------------------------------

Use this loop for action sequences:

1. rough stick storyboard (3-6 frames)
2. OpenPose pass for body geometry
3. Canny/Lineart pass for composition and world lines
4. depth-assisted refinement for volume
5. final styling only after staging is locked

This order keeps you from wasting time polishing frames that fail as storytelling.

[ Tooling Notes For 2026 Workflows ]
------------------------------------------------------------

You can run this process in several ecosystems:

- Stable Diffusion + ControlNet for explicit structural control
- Midjourney for fast visual ideation and style iteration ([docs](https://docs.midjourney.com/hc/en-us))
- OpenAI GPT Image workflows for iterative conversation-driven edits ([docs](https://platform.openai.com/docs/guides/image-generation?image-generation-model=gpt-image-1))

Each tool has different strengths. For action staging, structural controls usually matter more than style controls.

[ Project Links For Practical Use ]
------------------------------------------------------------

Use these links if you want to go deeper on implementation details:

- [ControlNet GitHub](https://github.com/lllyasviel/ControlNet)
- [ControlNet paper](https://arxiv.org/abs/2302.05543)
- [Diffusers ControlNet docs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/controlnet)
- [OpenPose GitHub](https://github.com/CMU-Perceptual-Computing-Lab/openpose)
- [Midjourney docs](https://docs.midjourney.com/hc/en-us)
- [OpenAI image guide](https://platform.openai.com/docs/guides/image-generation?image-generation-model=gpt-image-1)

[ Keep A Lens Language Across The Sequence ]
------------------------------------------------------------

Pick a narrow lens kit for one action scene (for example: 24mm for wide motion + 50mm for medium coverage). If each frame uses a random implied lens, your sequence feels visually disconnected even when poses are good. Lens continuity is one of the easiest upgrades for perceived production quality.

[ Practical Quality Gate ]
------------------------------------------------------------

Before finalizing an action frame, check:

1. Can someone understand the action in one second?
2. Is the center of gravity believable?
3. Does camera angle amplify the action, not flatten it?
4. Do environment lines support motion direction?
5. Does the next frame feel implied?

If any answer is "no," iterate on staging first, not texture. That discipline keeps action readable and saves significant production time.