![](assets/kapernikov-logo.png)

# Agentic AI Coding Workshop

**From prompts to autonomous coding agents**

January 2026

---

# Today's Goal

**Be artful with AI agents** — gain productivity while maintaining quality and security

This is for everyone at Kapernikov, regardless of current AI experience.

---

# Agenda (4 hours)

| Block | Duration | What |
|-------|----------|------|
| **Principles & security** | ~45 min | Best practices, feedback loops, sandboxes |
| **Tools & workflows** | ~30 min | Skills, sub-agents, worktrees, Spec-Kit |
| *Break* | 15 min | |
| **Tips & tricks** | ~15 min | Share what works, Q&A |
| **Hands-on** | ~2 hours | Build a reusable workflow |

We'll keep slides short. The real learning happens when you try it.

---

<div style="text-align: center; font-size: 1.8em; padding-top: 200px;">
Agentic AI will make all of us the 10x coder, right?
</div>

---

---

# Discussion

- Why do you think this happened?
- What does this mean for us?

---

---

# Why Most Don't Capture the Gains

We're early. The difference isn't the tools — it's how you use them.

---

# Part 1: Understanding Agentic AI

## From chatbots to autonomous agents

---

# The Evolution of AI Assistants

| Generation | Capability | Example |
|------------|-----------|---------|
| **Chat** | Single-turn Q&A | ChatGPT (basic) |
| **Autocomplete** | Code completion with file context | GitHub Copilot |
| **Agentic** | Autonomous task execution | Claude Code, Cursor |

---

# How a chatbot talks to an LLM

</div>

---

# The Agentic Loop

</div>

---

# AI agents: Strengths and weaknesses

An AI agent writes code at a very high speed in any technology you want. But it has some weaknesses:

* **Limited memory** — doesn't learn from past failures, limited context window
* **Notoriously bad at judging** — is this good enough? Are we optimizing what should not exist?
* **Doesn't identify structural issues** — an agent is always happy and never says "I can't continue working this way"
* **Doesn't see patterns over time** — an agent is really a code factory

> Key to successful agent usage is taking these limits into account

---

# Part 2: Coming to best practices

## What does it take for us to increase productivity with AI agents ?

---

# The problem

We want to find a way of working that:

* Produces high-quality code
* Increases the output of one developer by factor X
* Enables learning (as a group!) and continuous improvement.

---

# Principle 0: Know your role

| Human (long-term memory) | | Agent (fast executor) |
|--------------------------|:-:|----------------------|
| Remembers past failures | → | Implements safeguards |
| Identifies friction | → | Builds tooling |
| Updates instructions and feedback | → | Follows best practices |
| Spots patterns | → | Writes tests to catch them |

**The asymmetry:**
- **Agent:** fast execution, cheap implementation, no memory across sessions
- **Human:** slow execution, but remembers patterns and failures over time

---

# Principle 0: Know your role

What's "good enough"? What's the right pattern? What does "done" look like?

The agent doesn't decide. It infers — from your code, your docs, your tooling, your error messages. The prompt is a small fraction of what shapes its output.

> Human role: be the input architect.
> Agent role: execute against it.

<!--
Evidence (papers):
- Sclar et al. 2024 — prompt format sensitivity: https://arxiv.org/abs/2310.11324
- Yin et al. 2025 — position bias in pairwise LLM judgments: https://arxiv.org/abs/2506.14092
- Vendrow et al. 2025 — benchmark label noise hides reliability gaps: https://arxiv.org/abs/2502.03461

The "input" the agent sees is much more than the prompt: repo layout,
file naming, AGENTS.md, code style, existing patterns, error messages
from your linter, design docs. All of it teaches the agent what
"good" looks like. Curate it on purpose — that's harness engineering.
This sets up the closing Harness Engineering slide.
-->

---

# Principle 1: Tame the iteration loop

![w:900](./assets/ail-a11e9780.svg)

</div>

> If your feedback is just syntax errors or "it doesn't work", **improve** it.

<!--
Goal: Make automated feedback as strong as possible

- Type checks, linters, tests → agent can self-correct
- Human review only for: judgment calls, architecture, edge cases
- Invest in feedback quality = fewer human interruptions

STRONG doesn't always mean 'more tests, more lints'. feedback needs to be quick. An agent has no notion of time, so it will happily run the full e2e test suite on 4 browsers just to reproduce a tiny bug in one of them.
-->

---

# Principle 1: The Feedback Signal

The stronger your feedback signal, the longer the agent can run without you.

- **The feedback signal** is the driver for agent iterations and replaces human intervention if properly designed.

## The feedback signal

* Agents can write tests extremely quickly. But they suck at figuring out what to test.
* Tests are not the only way to make a feedback signal. Mandatory checklists, adversary reviews, ... are other tools.

---

# Principle 1: The Feedback Hierarchy

**Ranking feedback signals**:

| Signal | Why |
|--------|-----|
| **E2E tests** | Capture intent, not implementation details |
| **Checklists** | Cheap to write, force completeness |
| **Runtime signals** | Logs, traces, metrics — what's actually happening as it runs |
| **Adversarial reviews** | Agent reviews its own work with a critical eye |
| **Human feedback** | Fewer but higher-value interventions |

The agent can write tests — but "write tests" is not enough. **You** define what to test, the agent implements the check.

---

# Principle 2: Keep agent context clear

Smart decisions require clear context. Three problems to avoid:

**Context window rot** — performance degrades as context fills; worse with long back-and-forth.

**Tunnel vision** — pre-seeding biases the agent toward one solution; it may miss better paths.

**Dilution** — human language is full of inconsistency. Pile every rule into one big instruction file and the model can't tell which ones matter. Everything important = nothing important.

> **Solutions:** start clean for every task · keep knowledge modular and narrow · only add a rule when the agent broke without it · **progressive disclosure** — layer detail (name → skill body → linked references).

<!--
tunnel vision is a real problem and sometimes requires you to break out of it. especially when debugging agents have a strong tendency to believe their own assumptions

Progressive disclosure is undersold — and it's more than one pointer plus one body. Structure knowledge in *multiple levels*, each loaded only when needed: a skill's name + description is always resident; the full body loads when it triggers; the reference files and scripts the body links to load only when that step reaches them (same shape for AGENTS.md → linked docs, and sub-agents). That's how you get a huge library at a tiny footprint — and why the top level has to earn its place: the description is the only thing always in context.
-->

---

# Principle 3: Engineer for the lack of memory

Don't just prompt - make knowledge **persistent and discoverable**:

- **What the agent can't see doesn't exist** — Slack threads, Confluence pages, things in your head are invisible. Move them into the repo (skills, AGENTS.md, design docs).
- **Tests + CI** - Encode expected behavior in code, not conversation
- **Linter rules** - Enforce patterns automatically, run in CI
- **Type definitions** - Make constraints machine-readable

> What you tell the agent once, it forgets. What you write in code, it follows forever.

---

# Principle 4: Iterate on the process

Principle 1 was about the agent's loop. This is about **yours**.

You are a **process engineer**. Every failure is an opportunity:

1. Fix the immediate bug
2. Ask: *why didn't the agent catch this?*
3. Improve feedback (add test, linter rule, type check)
4. Update AGENTS.md/skill/... with the lesson learned

> The goal isn't just working code - it's a system that produces working code reliably.

---

# Harness Engineering

What you've been learning has a name. OpenAI articulated it best:

> "Building software still demands discipline, but the discipline shows up more in the **scaffolding** rather than the code."

Your **harness** is the feedback loops, skills, tooling, runtime signals, **agent eyes** (Chrome DevTools MCP), and review surface around the agent. Every primitive in this deck is a piece of one.

> Stop optimizing prompts. Start designing harnesses.

<small style="opacity:0.55; font-size:0.6em">
Source: openai.com/index/harness-engineering · Feb 2026
</small>

<!--
This is the synthesis of Part 2 — the principles. Tie back:
- Principle 0 (input architect) = curating what the agent infers from
- Principle 1 (iteration loop, feedback hierarchy with runtime signals) = the central loop
- Principle 2 (clear context, dilution) = keeping the harness signal-to-noise high
- Principle 3 (engineer for lack of memory) = persisting knowledge in the repo
- Principle 4 (iterate on the process) = the meta-loop maintaining the harness

What's NEW relative to what we've covered:
- "Agent eyes": Chrome DevTools MCP — agent drives a browser, takes DOM
  snapshots, screenshots, reads console + network. This closes the visual
  feedback gap that was painful in the chat-agent + sandbox feature. Not a
  slide of its own, but worth mentioning here as the concrete next investment.

The deck now picks up the security primitive (Part 3), then skills,
sub-agents, worktrees, spec-kit, code-review etiquette — each a slice of
the harness already named.
-->

---

# Part 3: Security

## The uncomfortable truth about agent sandboxes

---

# Security: The Uncomfortable Truth

**Current options:**

| Approach | Problem |
|----------|---------|
| Permission prompts | Slow, annoying — you click "yes" anyway |
| Command sandboxes | Not actually secure, easy to escape |
| Disable all guardrails | Fast, but... |

None of these actually work. Let's look at why.

---

# The Sandbox is Leaky

![](assets/sandbox-heredoc-permission.png)

</div>
<div class="col-right">

Would you say "yes"?

<br/>

If so, because you **trust the agent**, not because you understood the command.

<br/>

The sandbox only works if humans actually review. They don't.

</div>
</div>

---

# Time of Check ≠ Time of Use

Approvals are per-action, but security is about the **combination**:

| Step | You approve... | Seems safe? |
|------|---------------|-------------|
| 1 | Run `make` | Yes — just builds |
| 2 | Edit `Makefile` | Yes — just a text file |
| **Combined** | **Agent edits Makefile, then runs make** | **Full arbitrary code execution** |

The checkbox model gives a **false sense of control**.

Each approval looks harmless. The sequence grants full access.

---

# What Are We Actually Worried About?

The risk is **not** "the agent goes rogue." It's: good faith + too much access + mistakes.

| Risk | Example |
|------|---------|
| **Credential exposure** | Agent commits `.env` with customer API keys |
| **Supply chain** | Agent installs a typosquatted package |
| **Destructive mistakes** | Agent runs `rm -rf` on the wrong directory |
| **Prompt injection** | Malicious content in a cloned repo influences agent behavior |

These are mundane operational risks, not science fiction.

> Treat the agent like a junior dev: not malicious, but don't give it root access

---

# What To Do About It

| | Recommendation | Status |
|---|---|---|
| **Minimum** | Don't point agents at repos with production secrets | Doable right now |
| **Minimum** | Don't experiment with new tools on your work laptop | Doable right now |
| **Better** | Use devcontainers / separate user accounts | Works, ergonomics are rough |
| **Best** | Fully containerized agent runtime, no host access | Tooling is coming, not here yet |
| **Always** | Review the **diff** before merging, not the bash prompts | Your real security boundary |

**Key insight:** `git diff` is your security review, not the permission checkbox.

> Make the boundary about the **artifacts** the agent produces, not the side effects

---

# For non-coders: the in-app embedded agent

</div>
<div class="col-right">

For non-developers, an AI agent **embedded inside the app** is the safer option.

The sandbox is enforced by the application's own permissions and data boundaries — not by trust in a CLI prompt.

</div>
</div>

<!--
Agent prompt in screenshot: "hello, can you make an analysis for me on
time logged last month with a chart of who logged most"

Pedagogical point: not everyone at Kapernikov is running CLI agents on
their laptop. For analysts, project managers, etc., the safest path is
to use an agent that lives inside an existing application (e.g. the
fullstack-sota chat agent, or vendor in-app assistants). The app's
authn/authz, data scoping, and audit logs are the security boundary —
much stronger than the "click yes" permission model of a desktop CLI.
-->

---

# It's Already Happening

![](assets/meta-x-post.png)

</div>
<div class="col-right">

</div>
</div>

---

# Teaching Your Agent

## Slash commands, skills, and how to write them

---

# Slash commands

Slash commands are **user-invoked** prompts:

![h:400](./assets/mermaid-67aa101d.svg)

</div>

---

# Skills

Skills are reusable workflows that agents load **on demand**:

![h:400](./assets/mermaid-d982846d.svg)

</div>

---

# Writing Good Skills: Principles

1. **Skills encode the delta** — only what the model doesn't already know about *your* project
2. **Start with nothing** — add skills after observed failures, not upfront
3. **Every skill should have a clear and narrow scope** — one job, done well
4. **Define what "done" looks like** — give the agent criteria to self-check, not just instructions to follow

> Don't use `/init` or ask the AI to write skills for you — it will produce 200 lines of generic advice it already knew without the skill

<!--
The "delta" principle is the most important one.
A skill that says "use descriptive variable names" is worthless — the model already does that.
A skill that says "we use pnpm, schemas live in src/db/schema/, run make check before commit" is worth its weight in gold.

The /init anti-pattern: the AI generates instructions for itself, filled with things it already knows.
The result is 95% noise, 5% signal, and the noise actively degrades performance by filling the context window.
-->

---

# Writing Good Skills: Recommendations

- **Describe *when* to trigger**, not just what it does — this is how the agent picks the right skill. If your agent doesn't find your skill when applicable, improve this!
- **Keep it (as) short as possible** Challenge every line: does the model really need this? *Note: this can change over time as models evolve*
- **Include one concrete example** of the desired outcome
- **Test by observing** actual agent behavior, then refine.

<!--
Examples scale better than guidelines.
One "ideal case" example teaches more than a page of rules.

The iterative approach: watch where the agent fails, encode that specific lesson as a skill.
The skill grows from observed failures, not from imagination.
This connects directly to Principle 4 (iterate on the process).

When models get smarter, skills might need some shortening.
-->

---

# From Skills to Workflows

Some skills are **knowledge** — the delta about *your* project, a few lines the model reads.

Others are **workflows**: an ordered procedure the agent *executes* to produce a result — specswarm, superpowers, a porting skill.

> Write the procedure as prose in the skill body and the agent treats it as **advice, not a contract** — it skips the "obvious" steps, reorders, eyeballs. Output goes high-variance: sometimes great, sometimes bad.

Workflows need their own techniques to stay on the rails.

---

# Writing Good Workflows

1. **Forcing functions — artifacts over memory.** A step is done only when its named artifact exists on disk. *No artifact = not done.* Don't trust the agent to self-report.
2. **Hard gates.** Checkpoints that need *evidence* to pass — tests green + a written verify note *before* the commit is allowed.
3. **Make the steps the task list.** Convert the procedure into tasks at the *start* — tracked state, not prose skimmed once and forgotten.
4. **Push determinism into scripts, not prose.** Ship helper scripts with the skill; the agent *runs* the recipe instead of re-deriving it each run. Re-derivation from prose *is* the variance.
5. **Compose, don't inline.** A heavy step invokes a sub-workflow or sub-agent instead of bloating one giant skill.

---

# Organizing the Work

## Sub-agents, worktrees, and structured workflows

---

# Sub-agents

Sub-agents provide **context isolation**:

![h:420](./assets/mermaid-f346a29e.svg)

</div>

---

# Git Worktrees

**Problem:** Agent working on feature X blocks you from working on feature Y

**Solution:** Git worktrees = multiple working directories, one repo

```bash
git worktree add ../my-project-feature-x feature-x
# /my-project           ← your main work
# /my-project-feature-x ← agent's sandbox
```

**Benefits:**
- Agent can run tests, break things, iterate — without blocking you
- Easy to review: just diff the worktree
- Clean disposal: `git worktree remove` when done

**Watch out:** Devcontainers with hardcoded ports won't run in parallel!

---

# Git Platform CLI Tools

**`gh` (GitHub) / `glab` (GitLab)** — Give your agent repo superpowers

```bash
gh issue view 42                                 # read issues
gh pr create --title "Fix auth bug" --body "Closes #42"   # open PRs
gh pr checks                                     # CI status
gh issue comment 42 --body "Fixed in PR #43"     # comment back
```

**Why this matters:**
- Agent can work end-to-end: issue → branch → code → PR
- No copy-pasting between terminal and browser
- CI feedback becomes part of the agent loop

---

# Spec-Kit / SpecSwarm (claude)

**Spec-Driven Development**: specifications are the source of truth

![w:900](./assets/ail-fd45981a.svg)

</div>

---

# Structure the Collaboration

One level of abstraction at a time. Don't spec and implement in the same breath.

| Phase | Focus | Artifact |
|-------|-------|----------|
| **Specify** | What do we need? | spec.md |
| **Research & plan** | How do we build it? | plan.md |
| **Task breakdown** | What are the steps? | tasks.md |
| **Implement** | Write the code | code + tests |

Each phase has its own conversation. Mixing them is how you get half-baked specs and wrong implementations.

<!--
This is the core of spec-driven development. Tools like Spec-Kit (GitHub),
SpecSwarm, and AWS Kiro all follow this pattern.

Sources:
- https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/
- https://www.thoughtworks.com/en-us/insights/blog/agile-engineering-practices/spec-driven-development-unpacking-2025-new-engineering-practices
-->

---

# Spec-Kit: Why It Works

| Problem | How Spec-Kit Helps |
|---------|-------------------|
| **No persistent memory** | Artifacts are the memory |
| **Context window rot** | Each phase has focused context |
| **Spec-code drift** | Specs are source of truth, regenerate code |
| **Weak feedback signal** | Clear tasks + success criteria → strong feedback |
| **Chaotic iteration** | Structured workflow with clear phases |
| **Hard to restart** | Can regenerate from any artifact |
| **Solving the wrong problem** | Forces you to define intent before writing code |

---

# Code review when an agent writes the code

When humans aren't writing the code, they're reviewing it. Make that time count.

| Role | Norm |
|------|------|
| **Author** | Read your own diff before sending. Minimize unrelated changes. |
| **Author** | Disclose intent (prototype? production?), not tool use. |
| **Author** | Every commit to main must work — feature-flag if cross-stack. |
| **Reviewer** | Prioritize reviews over writing — your queue blocks the team. |
| **Reviewer** | Focus on APIs and tests, not personal style. |
| **Reviewer** | An agent can help you digest a PR — but **you** must understand it. |
| **Both** | PRs are not for architecture debates. Design docs are. |
| **Both** | Style debates → linter rule or skill, not PR comments. |

<!--
Source: oblique.security/blog/how-oblique-handles-code-review-etiquette/

Key counter-intuitive points:
- "Authors are not required to disclose their use of AI tools, and reviewers
  should have the same standards regardless of how a change was created."
  Disclose intent, not tool.
- Slop in both directions: don't send slop PRs (disrespects reviewer time);
  don't send slop reviews (disrespects author effort).
- Agent code has tells reviewers should flag: long flowery names, no-substance
  multi-line comments, spaghetti helpers.
-->

---

# Your tips and tricks ?

What have **you** discovered?

- Prompting patterns that work
- Tools or configurations worth sharing
- Pitfalls to avoid
- Workflows that save time

---

# Part 5: Hands-on Session

## Teach Claude something it couldn't do

---

# Rescue a Slop Article

Pick one of these AI-written articles. Ask Claude *"rewrite this, make it better"* — it just gives you **smoother slop**.

Your job: build a **reusable skill** that *reviews* (catches the tells, the empty claims, the outright lies), **then** rewrites. Present the **structure** you built — not the prose.

<div style="display:flex; justify-content:center; gap:48px; margin-top:36px;">
<div style="text-align:center;"><img src="assets/qr-slop-1-arch.png" width="180"/><br/><small>① Arch "Rust init"</small></div>
<div style="text-align:center;"><img src="assets/qr-slop-2-airevolution.png" width="180"/><br/><small>② The AI Revolution</small></div>
<div style="text-align:center;"><img src="assets/qr-slop-3-whatisai.png" width="180"/><br/><small>③ What is AI?</small></div>
<div style="text-align:center;"><img src="assets/qr-slop-4-mlimportance.png" width="180"/><br/><small>④ ML in the Modern World</small></div>
</div>

<!--
THE EXERCISE — no code, no tests, no Docker. The feedback loop is your own
judgment: you read it and you can feel it's still slop. That's why it works for
non-coders too. The win we want them to feel: "I taught Claude to do something
it couldn't do on its own."

Why these articles: each is GENERIC BUT HAS A TRUE CORE worth preserving —
that's what makes "improve this" a real task rather than "delete and rewrite".
Pure zero-content hype can't be improved, only gutted.

The real links (read them out / paste in chat):
  ① Arch Linux Rust init (Linux Journal, web archive) —
     https://web.archive.org/web/20250618001301/https://www.linuxjournal.com/content/arch-linux-breaks-new-ground-official-rust-init-system-support-arrives
     NOTE: its central claim is FABRICATED — Arch has no official Rust init
     system, and the live page was pulled (404). Great for the false-claim lens:
     baseline "make it better" will happily polish the lie.
  ② The AI Revolution / LLM Models —
     https://danyalahmaad.medium.com/the-ai-revolution-how-llm-models-are-shaping-our-digital-future-e2cc6444c250
  ③ What is Artificial Intelligence (dup title) —
     https://medium.com/@sharetogomathy/what-is-artificial-intelligence-what-is-artificial-intelligence-1ded70f06c76
  ④ Growing Importance of Machine Learning —
     https://medium.com/@bikashpeeripaul90/the-growing-importance-of-machine-learning-in-the-modern-world-031fa4b15b7e
-->

---

# Some Tips

- **Know the goal first.** Who's it for, and what do you want from it? *"I'm a teacher, my readers are students"* and *"I'm a marketer, I want clicks"* produce completely different rewrites.
- **Prove the gap.** Run baseline *"make it better"* and capture where it stays slop *against that goal*. No gap → nothing to teach.
- **Review, then improve — with adversaries.** Two phases. For the review, spin up several subagents, each with specific instructions on a clean context, each adversarially attacking *one* thing: tells · false claims · phantom references · structure.
- **Steer with your taste.** Give it the phrasings you ban *and* a style distilled from documents you wrote or admire — what to avoid *and* what to aim for.

<!--
Goal: every other tip is relative to this one. The teacher and marketer goals
can even be in TENSION (clarity & accuracy vs hooks & clicks) — that tension is
exactly why "improve" on its own fails. Callback to the earlier principle slides:
define intent / solving the wrong problem.

Review→improve: the key moves are (a) two separate phases, not one pass; (b) a
CLEAN, isolated context per subagent so concerns don't bleed; (c) ADVERSARIAL
framing — "find every unsupported claim" beats "review this", which just gets
agreeable mush. Decomposition beats the monolith — the same lesson transfers
straight to code and data work. Subagents are the recommended path here.

Steer with taste: the ban-list is the negative space (your flagged words); the
distilled style is the positive target. Reference/exemplar steering generalises
far better than abstract rules like "be punchy". Distilling a style from past
docs is itself a reusable skill (this repo ships a kapernikov-writing-style
skill) — which quietly reinforces "the structure is the deliverable".

(Structure-is-the-deliverable lives on the title slide: "Present the structure
you built — not the prose.")
-->

---

# For Coders: Dockerize Anything

Point an agent at any GitHub project and it'll hand you a **runnable Docker image**. That part it can do on its own.

**But can it make sure the image:**

- **starts ergonomically** — one command, sane defaults, no fiddling?
- **runs many instances side by side** — no clashing ports, names, or volumes?
- **runs as non-root**?
- **has no known vulnerabilities** — actually scanned, not assumed?

Each "but can it" is a review lens — one adversarial subagent each. The deliverable is a reusable **`/dockerize`** skill that bakes the standards in.

---

# More Ideas — Same Pattern

- **Tech Debt Hunter** — `/techdebt` finds dead/duplicated code, proposes fixes
- **CSV Health Inspector** — any CSV → a data-quality report
- **Documentation Generator** — point at a repo → structured docs

---

# Call to Action

## Modular context is a team sport

---

# Build It Together

Modular agent context isn't just better for the agent — **it's better for collaboration**.

A skill, a guardrail, a linter rule — **shareable building blocks**. When you write one, everyone benefits.

**What I want to see:**
- **Create skills** — even small ones. A commit workflow, a review checklist
- **Share best practices** — what worked, what didn't. Write a skill, not a wall of text
- **Build on each other's work** — extend a skill, improve guardrails, add tests

> The goal isn't 10 people building 10 workflows. It's 10 people building on 1 workflow, making it great.