What I learned about AI character consistency from making a cartoon say things I think

I run a YouTube channel called Dev Life with Uche. The host is a cartoon character. He wears a purple hoodie and a black cap, has a goatee and a slightly raised eyebrow, and breaks down the parts of startup engineering that nobody talks about honestly. The character is an editorialised version of me, more or less. The whole channel is automated, and getting it to work took two weeks I didn't expect to spend on a single problem: character consistency.

I run a YouTube channel called Dev Life with Uche. The host is a cartoon character. He wears a purple hoodie and a black cap, has a goatee and a slightly raised eyebrow, and breaks down the parts of startup engineering that nobody talks about honestly: why your agile process is really just waterfall, what your PM means by "quick fix", why tech debt never gets paid down. The character is an editorialised version of me, more or less. Same opinions, same decade of shipping things, different wardrobe.

The whole channel is automated. Scripts come from Claude. Voice comes from ElevenLabs. Images come from a generative model. Video assembly is FFmpeg. I publish three long-form videos a week with shorts drip-scheduled between them, and the human-in-the-loop time per video is somewhere between five and ten minutes of skimming the generated images for quality. The pipeline works. But getting it to work took two weeks I didn't expect to spend on a single problem.

The problem was character consistency. The cartoon Uche has to look like the same person across twenty-five images per video, watched in sequence. If his hoodie shifts from purple to magenta between cuts, if his face changes shape, if his goatee disappears, the whole illusion of authorship collapses. The character isn't just decoration. He's the brand. Brands don't drift between frames.

This post is about the two weeks I spent fighting that drift, the three approaches I tried, and the architectural distinction that finally fixed it. The short version is: LoRAs are the wrong tool for character consistency in multi-character scenes. Reference-image generation is the right one. The community recommendation I encountered for character consistency is to train a LoRA. That recommendation is wrong for the use case I'm describing, and the failure mode is specific enough to be worth documenting.

What the channel actually needs

Before I talk about what didn't work, let me describe what consistency means here, because the requirements are stricter than people assume.

Each video has 20 to 25 generated images. They're scene-changes, not slow pans across one image. A video about agile dysfunction might have Uche at his desk, then in a planning meeting with three other engineers, then at a whiteboard, then on a call with a PM, then walking out of the office at 9pm. The character has to be recognisably the same person in all of those scenes. The same skin tone, the same hoodie colour, the same face shape, the same expression vocabulary. Other people in the scenes need to look like not him, with different ethnicities, different clothes, different builds. Especially that last part.

That's the bar. It sounds reasonable. It is, in fact, very hard.

Topic from calendar
│
▼
┌───────────┐   ┌───────────┐   ┌────────────┐
│ Claude    │──→│ ElevenLabs│──→│ Nano Banana│
│ Script    │   │ Erion 1.2x│   │ 25 imgs+ref│
└───────────┘   └───────────┘   └─────┬──────┘
                                    │
     ┌──────────────────────────────┘
     ▼
┌───────────┐   ┌───────────┐   ┌────────────┐
│ FFmpeg    │──→│ Pillow    │──→│ YouTube API│
│ Ken Burns │   │ Thumbnail │   │ Tue/Thu/Sat│
└───────────┘   └───────────┘   └─────┬──────┘
                                    │
                                    ▼
                              ┌────────────┐
                              │ Shorts     │
                              │ 5-7 per vid│
                              └────────────┘

Approach 1: Imagen with a reference sheet

The first thing I tried was the same approach my other automated channel (The Narrow Path) uses: Gemini Imagen 4, with character_sheet.png provided as a reference. The character sheet is a single image showing front and back views of the character in a neutral pose, with all the design choices locked: face shape, hair, hoodie colour (#7B3FE4 to #9B5CFF, neon purple), cap, sneakers.

For The Narrow Path, this works fine. The Narrow Path's videos are about Christian wisdom, the imagery is historical or symbolic, and the human figures are different in every video by design. There's no recurring character to keep consistent.

Dev Life is the opposite. There is a recurring character, and Imagen treated character_sheet.png as a hint rather than as an identity anchor. The hoodie shifted between scenes, sometimes magenta, sometimes pink, sometimes a colour I'd describe as "violet that has given up." The face drifted across frames. The goatee came and went. In a 25-image video, you'd see five or six images that looked like the same person and twenty that looked like cousins.

This is the failure mode that drove me to look at LoRA training in the first place. If reference images aren't strong enough, the standard recommendation is to fine-tune the model on the character. So I did.

Approach 2: Flux LoRA on fal.ai

A LoRA, Low-Rank Adaptation, is a small fine-tune that you train on top of a base model to teach it a specific concept. You feed it a few dozen images of the thing you want it to remember, train for a thousand steps, and the resulting LoRA can be activated at inference time with a trigger word like "uchedev". The community wisdom is that LoRAs are how you get reliable character consistency.

I trained one on fal.ai for $2. The training data was 25 images I'd generated through Google Whisk, which was producing the consistency I wanted manually but didn't have an API. One thousand steps, trigger word "uchedev", flux-lora-fast-training endpoint. Deployment took twenty minutes.

The first single-character test image was a revelation. The LoRA produced a near-perfect rendering of the character, same face, same hoodie, same expression vocabulary. Better than anything Imagen had managed. I generated a second test, then a third. All clean. I started rebuilding the pipeline to use the LoRA.

Then I generated a meeting scene.

The meeting room had Uche, a senior PM, two other engineers, and a designer. Five people, supposed to be a multi-ethnic group with different clothes, different hair, different builds. What I got was five identical Uches sitting around a table.

The LoRA had no concept of 'uchedev only applies to one person in the scene.' Every face it generated, it generated as Uche.

The LoRA had done exactly what I trained it to do, which was to map "uchedev" to a specific visual identity. But the scope of the mapping wasn't "the character named Uche should look like this." It was "this is what humans look like in this domain." The LoRA had no concept of "uchedev only applies to one person in the scene." Every face it generated, it generated as Uche, because that's the identity it had learned.

I tried reducing the LoRA scale. Scale 1.0 produced full cloning. Scale 0.65 fixed the cloning but lost the character entirely, the face changed, the skin tone lightened, the hoodie became some indeterminate purple-adjacent colour. Scale 0.9 was the best compromise but still fragile, particularly in scenes with women. The LoRA had a tendency to render Uche as female when there were other women in the frame. I'm not entirely sure why. Maybe the training set didn't have enough examples of him alongside women, and the model decided gender was something that flowed locally rather than being an attribute of the trained character.

I added explicit gender enforcement to every prompt: "uchedev must always be depicted as male, with visible facial hair, masculine build." This reduced the gender drift but didn't eliminate it. The LoRA was fighting the prompt rather than cooperating with it.

The other practical issue: fal.ai's CDN URLs for the trained LoRA weights are temporary. They expire. I had to upload the weights to persistent storage and refer to them through a stable URL, which worked but added a small layer of operational fragility I didn't love.

I spent about a week trying to make the LoRA approach work. Most of that time was spent in the prompt-engineering branch of the failure, trying to find the magic incantation that would make the LoRA preserve Uche's identity without applying it to everyone else. I never found it. I'm now reasonably confident no such incantation exists, because the architectural problem is real: a LoRA learns a concept, and "this is what a person looks like in this context" is a concept the model can't easily distinguish from "this specific person looks like this."

Approach 3: Nano Banana Pro

The pivotal moment was a manual test in Google Whisk. Whisk's interface lets you provide a reference image and a prompt, and the underlying model treats the reference image as a subject to preserve rather than as a style hint. I gave it character_sheet.png and asked for a meeting room scene. The output was clean: Uche on the left, three other people who looked like not Uche, all distinct, all matching the prompt's diversity requirements.

Whisk was great. Whisk was also shutting down on April 30, 2026, and didn't have an API anyway. The question became whether there was a programmatic way to get the same behaviour. The answer turned out to be Google's Nano Banana Pro (gemini-3-pro-image-preview), which is a version of the same generative stack with API access. It accepts a reference image alongside the prompt and treats the reference as a subject to preserve.

I rebuilt the pipeline against Nano Banana Pro. The first meeting scene came out looking exactly like the Whisk test: Uche in his purple hoodie on the left, a blonde woman in a navy blazer next to him, an East Asian man with glasses across the table, a South Asian woman in a white blouse at the head. Every character distinct. No cloning. No purple hoodie bleed onto extras. Cost: $0.134 per image.

I shipped it. The LoRA and its associated infrastructure went into a legacy/ directory and the Flux pipeline was retired.

Why this distinction matters

The architectural difference between a LoRA and a reference-image approach sounds subtle but it's load-bearing.

A LoRA modifies the model's weights to learn a concept. When you ask it to generate a person, the modified weights bias the output toward the trained concept. The model has no notion of which person in the generated image should match the trained concept. It just knows that "person, in this domain, tends to look like this." That's why the cloning happens. The trained concept is global to the generation, not scoped to a particular subject.

A reference-image approach, in contrast, gives the model a specific input it should preserve and a separate prompt it should follow for everything else. The reference image and the prompt occupy different conceptual slots in the generation. Subject identity is tied to the reference; everything else flows from the prompt. The model can generate diverse extras because the diversity is in the prompt, and it can preserve the subject because the subject is in a different channel of the input.

If you read this and think "well, obviously, that's how it should work", yes. In retrospect it's obvious. But the LoRA-as-character-consistency-tool is the dominant pattern in community guides and stock photo and character art workflows, where it works great for the single-character case. The multi-character case is where it falls down, and most guides don't mention it because most guides assume the use case is "I want this character to be the focus of every image."

For an automated channel where the host is the focus of some images and one of several characters in others, you need an approach that can scope identity to a single subject. LoRA can't. Reference-image can.

# pipeline/image_gen.py — Flux LoRA approach (retired)
import fal_client
 
LORA_SCALE_SOLO = 1.0
LORA_SCALE_MULTI = 0.9
 
# heuristic to detect multi-character scenes so we can soften the LoRA's grip
MULTI_CHARACTER_KEYWORDS = [
    "colleague", "team", "meeting", "group", "boss", "PM", "junior",
]
 
 
def _lora_scale_for_prompt(prompt: str) -> float:
    if any(kw in prompt.lower() for kw in MULTI_CHARACTER_KEYWORDS):
        return LORA_SCALE_MULTI
    return LORA_SCALE_SOLO
 
 
def generate_image(prompt: str, output_path: Path) -> Path:
    scale = _lora_scale_for_prompt(prompt)
    result = fal_client.subscribe("fal-ai/flux-lora", arguments={
        "prompt": prompt,
        "loras": [{"path": config.FAL_LORA_URL, "scale": scale}],
        "image_size": "landscape_16_9",
        "num_images": 1,
    })
    output_path.write_bytes(requests.get(result["images"][0]["url"]).content)
    return output_path
 
 
# Example prompt for a meeting scene:
#
#   uchedev, male character with visible goatee, plain neon purple hoodie,
#   matte black cap, standing outside a glass meeting room watching a
#   retrospective. Through the glass: a blonde woman in a beige blouse, an
#   East Asian man in a grey polo, a redhead in a black cardigan, a bald
#   man in a dark suit. Only ONE person in this scene wears a purple hoodie
#   and black cap. All other people have completely different appearances.
 
 
# pipeline/image_gen.py — Nano Banana Pro approach (current)
from google import genai
from google.genai import types
 
MODEL = "gemini-3-pro-image-preview"
 
 
def generate_image(prompt: str, output_path: Path) -> Path:
    char_ref = config.CHARACTER_REF_PATH.read_bytes()
    response = _client.models.generate_content(
        model=MODEL,
        contents=[
            types.Content(parts=[
                types.Part.from_bytes(data=char_ref, mime_type="image/png"),
                types.Part.from_text(text=prompt),
            ])
        ],
        config=types.GenerateContentConfig(response_modalities=["IMAGE", "TEXT"]),
    )
    for part in response.candidates[0].content.parts:
        if part.inline_data and part.inline_data.mime_type.startswith("image/"):
            output_path.write_bytes(part.inline_data.data)
            return output_path
 
 
# Example prompt for the same scene:
#
#   Generate an image based on the character in the reference image. The main
#   character stands outside a glass meeting room watching a retrospective.
#   Through the glass: a blonde woman in a beige blouse, an East Asian man
#   in a grey polo, a redhead in a black cardigan, a bald man in a dark suit.
#   Cel-shaded cartoon style, 16:9 cinematic.

The visible difference between the two functions tells the story. The LoRA approach has a heuristic for detecting multi-character scenes, two scale constants, a keyword list of meeting-related terms, and a prompt that ends with capitalized pleading about only one person wearing the hoodie. The Nano Banana Pro approach has none of that. The character reference goes into one input slot, the scene prompt goes into another, and the model handles the rest.

The numbers, before and after

The full Dev Life with Uche pipeline currently costs about $50-60/month at twelve long-form videos plus around fifty shorts, of which the image generation alone is around $40-50. Per-video, the image generation is roughly $4 to $7 depending on shorts count.

The week I spent on the LoRA approach cost about $12-15 in training and test images, which is an underestimate of the real cost because it doesn't capture the time. The two weeks I spent fighting consistency end-to-end, across all three approaches, is the more honest number. Two weeks of debugging on a side project to discover that the architectural assumption was wrong.

The post-fix pipeline produces about 25 character-consistent images per video at $0.134 each, with a 5 to 7 minute human review window per video where I scroll through and flag any obvious issues. The flag rate is low. Most videos go through without manual intervention. The character looks like the same person across every frame. Other people in scenes look like not him.

What I'd take from this

Three things, the third of which is the one I'd most want another engineer to know.

The first is that I should have tested Nano Banana Pro before training the LoRA. The Whisk results were available the whole time, the underlying capability was visible, and I went down the LoRA path because LoRA training was the recommended approach in the community guides I read. The lesson is that "industry standard" workflows are calibrated to the median use case, and your use case might not be the median. Test the cheap option first.

The second is that the art style prompt matters more than the model. Switching from "minimalist flat cartoon, clean vector art, minimal shading" to "detailed cel-shaded cartoon illustration with realistic lighting, volumetric shading, ambient occlusion in creases and folds" produced a bigger quality jump than switching from Imagen to Nano Banana Pro did. If your generated images look generic, the answer is more often a better art-direction prompt than a different model.

The third is that LoRAs and reference-image approaches solve different problems and the difference is architectural, not just operational. A LoRA learns a concept; a reference image preserves a subject. If your use case has multiple humans in a single scene and you need one specific human to be consistent while others are diverse, you need the second approach, not the first. The community wisdom hasn't caught up to this yet because the dominant LoRA use cases (anime characters, single-portrait styles, branded mascot art) don't have the multi-character requirement. Mine does. Yours might too.

The cartoon Uche is now stable across every frame of every video. He shows up, says things I think, and looks like the same person. The pipeline takes me five to ten minutes per video. I get to ship three videos a week. The most expensive lesson in the project was that the standard answer was the wrong answer for my problem, and the right answer was sitting in a manual tool the whole time.