Chapter 7: Automation
From Assistant to Autonomous Agent
Up to this point, everything Claude Code does happens because you told it to. You say "launch training," CC launches training. You say "check GPU usage," CC checks GPU usage. You say "fix this error," CC fixes the error. It's a capable assistant, but it's still reactive. It waits for you.
That's about to change.
This chapter is the turning point of the entire guide. You're going to give CC the ability to detect problems on its own, investigate them on its own, and fix them on its own. Not because you asked — because something happened. A training run crashed. A download stalled. A GPU went idle. CC notices, CC responds, CC resolves. And if it can't resolve, it waits for you to check in and gives you a one-line summary of what happened and what it tried.
The tools are surprisingly simple: a hook system that triggers scripts when CC uses certain tools, a watchdog script that monitors your server independently, and a built-in scheduler that lets CC check in on things periodically. Separately, they're just small utilities. Together, they create a feedback loop that turns CC from a tool you talk to into a system that acts on its own.
After this chapter, your GPUs no longer wait for you to notice that something went wrong.
Hooks: Teaching CC to React
Claude Code has a hook system. It lives in ~/.claude/settings.json, and it lets you run shell commands automatically when CC does certain things.
The most useful hook type is PostToolUse — it fires every time CC finishes using a tool. Every time CC runs a Bash command, reads a file, or writes something, PostToolUse fires. Your hook script receives context about what just happened, and it can take action based on that context.
Why does this matter? Because it means CC can trigger side effects without being explicitly told to. You don't have to remember to say "also start monitoring." The hook handles it.
The Concrete Use Case
Here's the scenario. You tell CC to launch a training run on your server. CC SSHes in, starts a tmux session, runs the training command. That's the normal flow. But with a hook configured, something extra happens: the moment CC runs that tmux command, a script fires that detects the new tmux session and automatically starts a monitoring process for it.
CC didn't ask about monitoring. You didn't ask about monitoring. The hook just made it happen.
Configuration
In ~/.claude/settings.json:
{
"hooks": {
"PostToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "/path/to/your/monitor-hook.sh",
"timeout": 15
}
]
}
]
}
}This says: every time CC uses the Bash tool, run monitor-hook.sh. The timeout is 15 seconds — if the script takes longer than that, it gets killed. Hooks should be fast.
What the Hook Script Does
The hook script receives the tool input as context. It checks whether the Bash command involved tmux — specifically, whether CC just started a new tmux session on a remote server. If it did, the script SSHes to that server and launches the watchdog monitoring process (which we'll cover next).
If the Bash command didn't involve tmux — say CC just ran ls or cat — the script exits immediately. No overhead, no side effects. The hook only activates when something monitoring-worthy happens.
The beauty of this design: you configure it once and forget about it. From that point on, every time CC launches a tmux session for training, downloading, or any long-running task, monitoring starts automatically. The human doesn't need to remember. CC doesn't need to remember. The system remembers.
Hook Design Principles
Keep hooks simple. A hook should do one thing: detect a condition and trigger a response. If your hook script is getting complicated, you're putting too much logic in the wrong place. The hook starts the watchdog. The watchdog does the actual monitoring. The hook is just the trigger.
Keep hooks fast. That 15-second timeout exists for a reason. Hooks run synchronously — CC waits for them to finish before continuing. A slow hook makes CC feel laggy. If your hook needs to start a long-running process, have it launch the process in the background and return immediately.
Keep hooks idempotent. If CC runs three tmux commands in a row, your hook fires three times. It should handle that gracefully — check if monitoring is already running before starting another instance. Double-monitoring wastes resources and creates confusing output.
Watchdog: Your Server-Side Guardian
The hook starts the watchdog. Now let's talk about what the watchdog actually does.
The watchdog is a Python script that runs on the server — not on your local machine, not inside CC. It's an independent process, sitting in its own tmux session, checking on your tasks every 60 seconds. It doesn't need CC to be connected. It doesn't need your SSH session to be alive. It just runs.
This independence is the key insight. CC connects and disconnects. SSH sessions come and go. Your laptop might restart. But the watchdog is on the server, in tmux, and it stays alive as long as the server is up. When CC reconnects, it reads the watchdog's output and instantly knows the state of everything.
What It Monitors
tmux session health. Is the session still alive? Is the process inside it still running? A tmux session can exist even after the process inside it has crashed — the session stays open, but the work has stopped. The watchdog detects this: session alive, process dead. That's a crash.
GPU utilization. The watchdog runs nvidia-smi and checks GPU usage. If a GPU that's supposed to be running training shows less than 5% utilization, something is wrong. The process might have crashed, hung, or entered an infinite loop that doesn't touch the GPU. Low GPU utilization on a "busy" GPU is a red flag.
Download progress. For file downloads (model weights, datasets), the watchdog checks whether the file size is growing. If the file hasn't grown in the last check interval, the download might have stalled. If it's growing but slowly (less than 1 MB/s when you expect 100 MB/s), the proxy might be broken or the server might be throttling.
Output Format
The watchdog writes a simple summary file that anything can read:
# watchdog.py runs on the server
# It checks all registered tasks every 60 seconds
# Output: /tmp/monitor/summary.txt
# Format: "task1: OK | task2: DEAD | task3: IDLE (GPU 0%)"That's it. One line per task, one word status. OK means everything is fine. DEAD means the process has stopped. IDLE means the GPU utilization is suspiciously low. STALLED means a download isn't progressing. SLOW means a download is moving but at a fraction of expected speed.
The summary file is human-readable. You can SSH in and cat it yourself. CC can read it with a single command. No parsing libraries, no JSON schemas, no API endpoints. Just a text file.
Why Not Monitor from CC Directly?
You might be wondering: why not just have CC SSH in and check nvidia-smi whenever it wants? Why bother with a separate watchdog script?
Three reasons.
First, continuity. CC's monitoring depends on CC being running. If your local machine restarts, if CC's session gets interrupted, if there's a network hiccup — monitoring stops. The watchdog doesn't care about any of that. It's on the server, it's in tmux, and it keeps checking.
Second, efficiency. Having CC SSH in every 60 seconds to run nvidia-smi creates a lot of SSH overhead. The watchdog is already on the server — it just runs a local command. No network, no latency, no connection setup.
Third, separation of concerns. The watchdog gathers data. CC makes decisions. The watchdog doesn't know how to fix a crashed training run — it just reports the crash. CC reads the report and decides what to do. Clean separation makes both components simpler and more reliable.
CronCreate: CC's Built-In Scheduler
The watchdog runs on the server and writes status files. But who reads them?
CC has a built-in scheduling feature called CronCreate. It lets CC set up periodic tasks — things that run on a timer, whether or not you're actively talking to CC. You don't need to install anything extra. It's part of Claude Code itself.
Here's how it works in practice:
Every 15 minutes:
SSH to server
Read /tmp/monitor/summary.txt
If all tasks show "OK" → do nothing, go back to sleep
If any task shows "DEAD" or "IDLE" or "STALLED" → investigate and fixCC sets this up with a single CronCreate command. One cron job per server, not per task. This is important — the watchdog already tracks all tasks on a given server and summarizes them into one file. CC reads one file and gets the status of everything.
The Check Cycle
When the cron fires, CC SSHes to the server and reads summary.txt. If every line says OK, CC does nothing. It doesn't log anything, it doesn't notify you, it doesn't take any action. Silence means everything is fine.
But if a line says DEAD — a training process has crashed — CC switches into investigation mode. It SSHes to the server, attaches to the tmux session (or reads its scrollback), looks at the last few lines of output to find the error, reads the training log files, and diagnoses the problem. Then it takes action: editing a config file, reducing the batch size, fixing a path, whatever the error requires. Then it restarts the training and goes back to sleep.
If a line says IDLE — GPUs showing near-zero utilization on what should be an active training run — CC checks whether the process is still alive but hanging (perhaps stuck on a data loading bottleneck) or whether it's crashed but the tmux session hasn't been cleaned up.
If a line says STALLED — a download that has stopped progressing — CC checks the network, restarts the download, or switches to an alternative mirror.
What CC Can and Can't Fix
CC is good at fixing deterministic, well-defined problems. OOM errors (reduce batch size). Missing file paths (fix the path). Crashed downloads (restart them). These are the bread and butter of training failures, and CC handles them well.
CC is less good at fixing subtle training issues. A loss that's not converging might mean the learning rate is wrong, the data is corrupted, or the architecture has a bug. CC can flag these (the watchdog reports OK because the GPU is busy, but WandB shows the loss plateauing), but it shouldn't unilaterally change your hyperparameters or architecture. Those decisions are yours.
The rule of thumb: CC fixes infrastructure problems automatically. It flags research problems for your attention.
The Auto-Fix Loop
Now let's put it all together. Hook, watchdog, cron, CC — four components that create an autonomous monitoring and recovery loop.
Here's how a real crash plays out:
Training crashes at 2:47am
OOM error at epoch 12 — model activations exceeded GPU memory
Watchdog detects (next check, within 60 seconds)
GPU utilization drops to 0%
tmux session still alive, but training process is dead
Writes "train-exp03: DEAD" to summary.txt
CC's cron fires (within 15 minutes)
Reads summary.txt, sees DEAD status
SSHes to the server
CC investigates
Reads tmux scrollback: "RuntimeError: CUDA out of memory"
Reads training config: batch_size=64
Reads the error details: needed 23.4GB, only 24GB available
CC fixes
Edits config: batch_size=32, gradient_accumulation_steps=2
(Effective batch size stays the same: 32 x 2 = 64)
Restarts training in the same tmux session
Training resumes from the latest checkpoint
You wake up at 7am
Training is at epoch 47
CC logs: "Crash at 2:47am, OOM at epoch 12.
Reduced batch size 64→32, added gradient accumulation.
Restarted at 2:52am. Running normally since."From crash to recovery: five minutes. From crash to your awareness: whenever you happen to check. The GPUs didn't sit idle for seven hours. You didn't get woken up. The research kept moving.
This is the moment. This is what the first six chapters were building toward. Your AI fixed a bug and restarted training while you were sleeping.
The Full Chain
Let's trace the complete automation chain from start to finish:
- You tell CC to launch training. CC SSHes to the server, starts a tmux session, runs the training command.
- The PostToolUse hook fires. It detects the tmux command and starts the watchdog on the server (if it's not already running).
- CC sets up a CronCreate job. Every 15 minutes, check the watchdog summary.
- You go to sleep. CC goes idle. The watchdog keeps monitoring.
- Something breaks. The watchdog detects it and writes to the summary file.
- CC's cron fires. CC reads the summary, sees the problem, and takes action.
- CC fixes the issue and restarts. Training continues.
- You check in the morning. CC gives you a status report.
No step in this chain requires your intervention. No step requires you to be awake, connected, or even aware that something happened. The system is closed — it detects, diagnoses, and recovers on its own.
Demo: Try It Yourself
You don't need a real training run to test this. Here's how to see the automation loop in action with a fake process.
Step 1: Start a Fake Training Process
SSH to your server and create a tmux session with a simple loop:
tmux new -s demo -d "python3 -c \"import time; [print(f'epoch {i}', flush=True) or time.sleep(2) for i in range(1000)]\""This prints "epoch 0", "epoch 1", "epoch 2"... every two seconds. It simulates a training process that's alive and producing output.
Step 2: Tell CC to Monitor It
In Claude Code, say something like: "I have a training process running in tmux session demo on lab-server. Monitor it."
CC will set up monitoring — either through the watchdog or by creating a cron job to periodically check the tmux session. Watch what it does. It should SSH in, verify the session exists, and set up periodic checks.
Step 3: Simulate a Crash
Kill the process inside the tmux session:
tmux send-keys -t demo C-cThe tmux session is still alive, but the process inside it is dead. This is exactly what happens when a training run crashes — the session persists, the work stops.
Step 4: Wait and Watch
Don't type anything in CC. Don't tell it to check. Just wait.
When CC's next monitoring check fires, it will SSH to the server, see that the process in demo is no longer running, and investigate. Watch it read the tmux scrollback, determine what happened, and take action.
Step 5: See the Recovery
If you've set things up correctly, CC detects the crash, investigates, and restarts the process — all without you saying a word.
For this demo, CC might simply restart the same command. In a real training scenario, it would read error logs, diagnose the root cause, and fix the underlying issue before restarting.
Practical Considerations
Don't Over-Automate
It's tempting to hook everything. Don't. Start with the one hook that matters most: tmux sessions triggering watchdog monitoring. Add more only when you have a clear, repeated use case.
Every hook is a piece of implicit behavior — something that happens without being explicitly asked for. Too many hooks and you lose track of what CC is doing behind the scenes. Start simple, add gradually.
Cron Frequency
Fifteen minutes is a good default for training monitoring. It means the worst-case delay between a crash and CC noticing is 15 minutes. For most research, that's fine — the difference between detecting a crash at 2:47am and 3:02am doesn't matter when you're asleep either way.
Don't set it to every minute. Every cron tick means an SSH connection and a cat command. That's lightweight, but it's not free. And CC doesn't need to check that often — the watchdog is doing the real-time monitoring. CC just needs to catch up periodically.
Multiple Servers
If you have tasks running on multiple servers, set up one cron job per server. Each server has its own watchdog instance, its own summary file, and its own cron check. They're completely independent. A crash on server A doesn't affect monitoring on server B.
When to Intervene Manually
The automation handles the common cases: crashes, OOM errors, stalled downloads, idle GPUs. But some situations need your judgment:
- Loss diverging (training is running, GPU is busy, but the experiment is failing)
- Results that look too good (possible data leakage or evaluation bug)
- Resource conflicts with other users on shared servers
- Decisions about whether to continue or kill a run that's not performing well
For these, CC flags the issue and waits for your input. The automation keeps the infrastructure running. The research decisions stay with you.
Checkpoint
Set up CC to monitor a tmux session on your server. Start a fake process in that session. Kill the process inside it. If CC detects the crash and restarts the process without you typing anything — congratulations, you have an autonomous research assistant.
Your GPUs now have a guardian that never sleeps, never forgets to check, and never cancels plans because an ablation study needs babysitting. The rest of this guide builds on this foundation — because once CC can act autonomously, you can start trusting it with the full research workflow.