Enterprise AI

Agentic Coding in Practice: 10 Steps with Failsafes that Get AI Agents to Actually Ship Code

Agentic coding without the hallucination drama: session-start, 10 disciplined steps, failsafes per step — the XALT workflow for reproducible AI coding results.

Jürgen von Hirschheydt·15.05.2026·11 min

Isometric illustration of an agentic-coding workflow: a glowing mint-teal path winding from bottom-left to top-right through multiple validation checkpoints, ending at a stylized AI-agent cube.

You know the scene: you're sitting with an AI coding agent on a bug, the agent confidently writes an elegant solution — and on push you notice it hallucinated the state of the repo. The branch doesn't exist, the referenced PR was merged days ago, the assumption about the database has been wrong for two weeks. Welcome to the default failure mode of naive agentic-coding workflows.

At XALT we've been building productively with coding agents for months. A working process has crystallized that's robust enough to actually produce code-in-production — not just prototypes. I'm deliberately writing "produce" because it best describes how code actually comes into being this way: controlled assembly-line work with clear validation gates, not improvisation jazz.

In this article I'll show you the complete workflow: an upstream session-start and ten steps, each with a documented failsafe. The pattern behind it: validate, don't assume — at every point where the agent is inclined to guess.

The problem: why naive agentic-coding workflows fail

Most teams use AI coding tools today like a faster autocomplete: prompt in, diff out, merge. That works as long as the problem is small and no state is in play. As soon as branches, PRs, databases, or external services are involved, the agent starts guessing — and guessing means hallucinating.

Typical anti-patterns we see regularly in code reviews:

State is assumed, not verified. The agent says "PR is open" even though it was merged hours ago. Nobody ran gh pr view.
Identifiers get paraphrased. customer_id becomes customerId; the branch name feature/billing-v2 becomes feat/billing-2. Doesn't work.
Build green = done. Pre-merge verification is limited to npm run build, with no tests, smoketests, or real DB and server checks.
Reply text as ground truth. "Done. The service is now responding correctly." — written by the agent without it having looked at a single server log.
Memory leakage between sessions. The agent starts every session from zero instead of maintaining daily logs, ADRs, and principles.
Stack PRs without discipline. Three open pull requests, all built on top of each other, all "almost done" — and one review comment rips the stack apart.

This isn't anecdotal. In the Stack Overflow Developer Survey 2024, 76% of developers said they use or want to use AI tools — but only 43% trust the correctness of the outputs. That trust gap is exactly the problem: anyone who wants to use AI code in production needs a workflow that structurally catches hallucinations instead of relying on gut feeling.

The solution: validate-don't-assume as a structural principle

Hallucination is not a bug that will eventually get fixed. It's a systemic property of every tool-using LLM agent — the agent generates plausible tokens, not verified facts. The only robust lever is the workflow it's embedded in. If you prioritize context over fiddling with the prompt, you'll recognize the idea from our piece on Agentic AI in the enterprise context (in German).

Our approach has three pillars:

State validation before every step. Before every PR setup, every branch switch, every "this works in prod too" statement, there's a verifying tool call: git fetch, gh pr view, a DB query, a logfile tail.
Failsafes per step, documented. Every workflow step has one or more failsafes that explicitly name where the agent would stumble — and how we prevent it. Failsafes live in principles.md (operational) and lessons.md (anecdotal).
Memory discipline parallel to the work. Daily logs are written during the work, not afterwards. Pre-pause wrap-ups are mandatory. That's the difference between an agent that starts from zero every day and one that works consistently across weeks.

Sounds like overhead? It isn't. Once the steps are in muscle memory, every bugfix cycle runs twenty minutes faster, because you no longer need four iterations to clean up hallucinations.

10 steps to reproducible agentic coding

Session start (before anything else)

Before anything else happens, the agent loads Identity, People, Principles, and Projects in parallel. Each of these four context layers has a clear function:

People: overview of the relevant humans and agents, how the agent interacts with each, and which invariants or special rules apply per person.
Principles: the most important operational rules the agent has learned — especially identifier discipline (verbatim-or-nothing) and hallucination defense.
Projects: which projects are open, each with status, current issues, and what's up next.

Failsafe: Don't skip session start, even when the task seems "obvious". It's exactly on the supposedly obvious tasks that the agent ends up missing the context it needs for clean decisions.

A side note that quickly becomes important: principles.md should be reviewed, consolidated, and compacted every evening. Otherwise the file rapidly grows past the soft cap of around 5 KB, collects prose and historical examples, and loses its function as an operational rulebook. Anything that doesn't belong in the principles moves into daily-logs/, where it persists without bloating the hot context.

Kickoff + context check

You describe the problem. The agent checks scope, looks for existing issues, runs git fetch and gh pr list/view for state verification. Only then does it formulate a problem hypothesis.

This is exactly where the motto validate, don't assume has to be firmly anchored. Not as politeness, but as reflex: no code proposal based on a state assumption that isn't backed by a tool call.

Failsafe: Validate, don't assume — no hypothesis without tool evidence.

Plan proposal + disagree-or-approve

The agent proposes order and scope, with reasoning and alternatives. You say "works", "different, because …" — or propose your own alternatives. The agent is allowed to push back; a real opinion instead of a yes-man echo is explicitly desired.

Failsafe: On reopen-after-tangent (you come back to the original problem after a detour), re-validate the problem shape first — don't start coding immediately. Otherwise the old plan gets squeezed into a new context.

Branch setup

Fresh branch from origin/main — or explicitly from the parent PR if the stack is intentional and documented. One open pull request at a time is the upper bound. For bugfixes, minimal-diff-first applies; refactoring goes into a separate PR, always.

Failsafes: One PR at a time. Minimal diff for bugs. Refactor and fix never in the same branch.

Implementation in small steps

For substantial changes, the ADR or doc update comes first, then the code. During coding, npm run check runs continuously as a fast-feedback loop. We know that principle from the shift-left approach (in German): find errors early, not late.

The verbatim-or-nothing rule is enormously important here — and so central that it sits as a core rule in principles.md. If, for example, a tool call doesn't go through cleanly, a naive agent will happily start hallucinating entire commit hashes or identifiers because the shape is plausible. A disciplined agent, on the other hand, consistently checks every identifier against whether it actually exists, recognizes the hallucination itself — and can even handle it autonomously.

We have our agents persist every such hallucination in a dedicated document. That builds a corpus from which we can later derive systemic measures — new failsafes, sharper prompts, additional validation gates exactly at the points where the agent statistically stumbles.

Failsafe: Verbatim-or-nothing for identifiers. No paraphrasing of variable names, branch names, ticket IDs, API endpoints. When the agent is unsure, it asks back instead of reconstructing.

Pre-merge verification, scaled with the change

Mandatory:

Build OK? (e.g. npm run check)
Tests OK? (e.g. npm run test)
Real DB, container, or SSH check when server state is involved.

Desirable depending on the change:

Smoketests (e.g. npm run test:smoke)
Manual curl or WS-bench run
Browser-based UI check for UI touches

Again the principle: verify, don't assume. This obviously presupposes that the agent was told from the start to write test cases — which should normally already fall out of the acceptance criteria on the issues.

Failsafes: Curl acceptance isn't enough for browser forms. And server logs are ground truth — not the reply text in which the agent writes "the deployment was successful". If you take that seriously, you start thinking early about clean observability (in German).

Commit + push + PR body

Conventional-commit message, no free-form. The PR body follows the structure What / Why / How verified, with bench output or test path as evidence where relevant. For UI touches, "suggested live-test before merge" is included as an explicit step for the reviewer.

At this point it really pays off to install the branch locally. There's no serious excuse for not doing so anymore — not when I can have the AI generate even the installation and configuration script that handles the last manual steps for me. Thirty minutes of setup friction become two commands.

Failsafe: The PR body contains a reproducible test path, not just a promise.

Review and merge — or back to the table

You review and merge. Or you send it back with precise feedback. During the review, the agent is silent or works on unrelated or docs-only topics.

That "downtime" can absolutely be used productively — but careful: every doc change that lives in the repository requires its own PR from main. Normally that doesn't produce merge conflicts with the running review because the paths are disjoint; still, discipline is required, otherwise the calm review slot turns into a second open stack.

Failsafe: No new feature PRs while a review is running. Finish the review, then set up the next stack.

Post-merge hygiene

git pull --ff-only, delete the branch (if the user didn't already do it on merge), update the daily log. If the ticket produced a lesson, it's noted as a candidate in lessons.md. On the next sweep, the agent checks whether the lesson is substantial enough for principles.md — and promotes it accordingly.

This step is often skipped, but it's what keeps the process stable long-term. This is where learned lessons flow back into future tasks. Skip it and you learn nothing — and two weeks later you make the same mistake again.

Failsafe: Hygiene right after the merge, not "later sometime". That "later" is exactly the point at which memory crumbles.

Memory discipline (parallel to everything)

The daily log is kept during the work, not afterwards. Memory sweep at end-of-day or before every longer break. MEMORY.md (or the corresponding database), principles.md, and projects.md are kept current — this isn't optional, this is the mechanism by which the agent works consistently across days.

It pays to be "polite" to your agent here and actively tell it when you're going to lunch or wrapping up for the day. The agent can then automatically start a memory sweep instead of being cut off mid-token-stream. That way you can pick up seamlessly even after longer breaks or a day change — exactly where you left off.

Failsafe: Files survive. The context window doesn't. Text beats memory, every day.

Stop-signal respect

When you give a pause signal — "continue tomorrow", "pizza", "lunch break", "short interruption" — the agent closes out cleanly. Finalize the daily log, mark open threads as "resume here" markers, no further iteration-just-this-one-thing.

Failsafe: End-of-session wrap-up in the daily log instead of trusting "the next session will still know …". It won't. Never.

Chat screenshot: in response to the pause signal "Let's pause here", the coding agent answers with a clean wrap-up — PR #109 merged, ADR-0024 archived, issue #110 marked as an open discovery thread, daily log including lessons written. — What a clean stop-signal handoff looks like in practice: two lines from the user, a structured wrap-up from the agent — PR status, open discovery, daily log with lessons. The next session now knows where to pick up.

"Almost every step has a validate-don't-assume component. That's not coincidence — it's the structural answer to hallucination as a systemic property of every tool-using LLM agent."

— Jürgen von Hirschheydt

What concretely sets this process apart from a naive LLM workflow

Looking at the workflow as a whole, four structural properties stand out — and these four are exactly what separates a prototype playground from a productive coding setup:

Step 1 (state validation) is non-optional. This one discipline prevented two class-5 hallucinations in the last week alone — cleanly documented. Estimated damage if you skip the step: 30–60 minutes of detective work after the push each time, plus at least one delayed merge.
Steps 3–5 (branch + scope + verification) are the main axis. This is where "I think that's about right" turns into a produced, verified diff. Skip one of those three steps and you pay it back double, two iterations later.
Step 9 (memory in parallel) decides the outcome across weeks. That's the difference between a ROM-like construct that starts from zero every day and an agent that works consistently across multiple days. It's also the lever that only really scales in larger setups with subagents (in German).
Failsafes are documented in principles.md (operational) and lessons.md (anecdotal). Both get read, both get maintained. That's maintenance work, not a one-off setup — and it's exactly that maintenance that makes the difference between a workflow on paper and one that still works five sprints from now.

The most striking pattern: almost every step has a "validate, don't assume" component. Step 1 for state. Step 4 for identifiers. Step 5 for correctness claims. Step 7 for PR status. That's not coincidence — it's the structural answer to hallucination as a systemic property of every tool-using LLM agent. Once you've understood that, you stop tinkering with the model and start building on the workflow.

Practical example: what the discipline prevented

Two examples from last week:

State validation in step 1 caught two class-5 hallucinations. In both cases the agent wanted to write code against a branch state that no longer existed. A git fetch and a gh pr view immediately refuted the assumption. Without the step: 30–60 minutes of detective work after the push each time, plus at least one delayed merge.
Skipping step 3 cost us a bugfix (PR #84 → #85). We went straight into an open stack without a clean branch because it was "just one line". Result: PR #84 was closed, the fix cleanly re-set-up in #85. Time cost: two hours. Lesson captured in lessons.md, failsafe wording sharpened in principles.md.

Small but representative. Extrapolated over a week of disciplined agentic-coding work, we land at one to two prevented hallucinations per day — and that's exactly the lever that justifies the workflow. If you want to lock this down organizationally, the XALT AI safety and innovation framework (in German) gives you the right frame; and if you want to take the step toward vibe coding in the enterprise context (in German), this workflow gives you the necessary foundation.

Takeaway: four core claims

To summarize:

Hallucination is structural, not cosmetic. The only robust counter-lever is a workflow with documented validation and failsafe points — not a "better prompt".
Validate, don't assume is the guiding principle. State, identifiers, correctness claims, PR status — everything is backed by tool calls, not by plausibility.
Memory discipline decides things across weeks. Daily logs, ADRs, a well-maintained principles.md and lessons.md are the difference between an agent that starts from zero every day and one that stays consistent across sprints.
Stop signals get respected. Pause words end the session cleanly. Otherwise that's exactly where the memory holes form that get expensive the next day.

If you want to hear how this mechanic scales in larger enterprise setups, our insights from the Enterprise AI Summit 2026 (in German) give you the strategic framing — and our piece on Jira in the agentic era (in German) gives you the tool-side perspective on agentic workflows along the Atlassian platform.

Your next step

If you or your team want to use agentic coding productively without being slowed down by hallucinations, book our agentic-coding workshop (4 hours, remote). We set up the workflow described here on your repo — including a principles.md template, a lessons.md skeleton, ADR templates, and a first verified pull request that demonstrates the mechanics live.

Agentic-coding workshop for your team

Four hours, remote, on your repo. We set up the workflow — including a principles.md template, a lessons.md skeleton, and a verified pull request that demonstrates the mechanics live.

Request workshop

Agentic Coding in Practice: 10 Steps with Failsafes that Get AI Agents to Actually Ship Code

The problem: why naive agentic-coding workflows fail

The solution: validate-don't-assume as a structural principle

10 steps to reproducible agentic coding

00 Session start (before anything else)

01 Kickoff + context check

02 Plan proposal + disagree-or-approve

03 Branch setup

04 Implementation in small steps

05 Pre-merge verification, scaled with the change

06 Commit + push + PR body

07 Review and merge — or back to the table

08 Post-merge hygiene

09 Memory discipline (parallel to everything)

10 Stop-signal respect