Skip to content

Chapter 8: Research Workflow


Putting It All Together

You have SSH access from anywhere (Chapter 2). You have tmux keeping processes alive across disconnections (Chapter 3). You have proxy tunnels so your server can reach the outside world (Chapter 4). You have Claude Code installed and running in a persistent session (Chapter 5). You've taught it about your specific servers, GPUs, conda environments, and conventions (Chapter 6). You've set up hooks, watchdog monitoring, and periodic health checks so it can act autonomously (Chapter 7).

Seven chapters of infrastructure. Individual tools, each useful on its own, but so far they've been presented in isolation. SSH config is just a convenience. tmux is just a session manager. Watchdog is just a monitoring script. You could use any of them without the others.

But the power of this system isn't in the individual pieces — it's in how they connect. SSH config enables Claude Code to reach the server in one command. tmux enables training to survive disconnections. The proxy enables the server to download models and log to WandB. Watchdog enables Claude Code to detect crashes. CronCreate enables Claude Code to check watchdog without you asking. Each layer enables the next.

This chapter is the map. It shows you how all those pieces fit together into a single pipeline that takes a research idea from your head to trained results on a server — with Claude Code orchestrating every step. The next chapter is the territory: you'll actually run through this pipeline with a real experiment.


The Pipeline

Every research experiment, no matter how complex, follows the same five-phase pattern:

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  1. Idea  │───▶│  2. Code  │───▶│  3. Sync  │───▶│ 4. Train  │───▶│5. Results│
│  & Plan   │    │           │    │ to Server │    │           │    │          │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
    LOCAL           LOCAL          rsync           SERVER          LOCAL

Phases 1, 2, and 5 happen on your local machine. Phase 3 is a bridge. Phase 4 happens on the GPU server. This separation is deliberate — your local machine handles intelligence (planning, coding, analysis), and the server handles computation (training). They communicate through SSH and rsync, with Claude Code sitting on the local machine and orchestrating the entire flow.

Let's walk through each phase in detail.


Phase 1: Idea & Planning (Local Machine)

This is where research begins. You have a question, a hypothesis, or a paper you want to reproduce. You sit down with Claude Code — or more likely, you're lying on the couch with your phone, talking to Claude Code through Termius.

Reading papers. You can point Claude Code at a PDF and ask it to summarize the key contributions, extract the training setup, or identify what datasets they used. It reads the paper and gives you a structured summary. You don't have to wade through 12 pages of dense notation to figure out that they used a learning rate of 3e-4 with cosine annealing — Claude Code pulls that out for you. It can also compare multiple papers, highlighting differences in their approaches and experimental setups.

Searching literature. Claude Code has access to tools like arXiv search, HuggingFace paper search, and web search. You can say "find recent papers on test-time training for domain generalization" and it comes back with a list of relevant work, each with a one-line summary. This isn't a replacement for deep reading — it's a triage tool. It helps you quickly identify which of the 50 papers on this topic are actually worth your time, which ones are the real baselines, and which ones you can safely skip.

Brainstorming approaches. Describe a problem to Claude Code and it proposes solutions. "I want to improve the robustness of CLIP zero-shot classification on corrupted images — what approaches exist?" It outlines prompting strategies, fine-tuning approaches, test-time adaptation methods, and more. You pick the direction. Claude Code fills in the details. It can also poke holes in your ideas — "this approach assumes access to target domain data at training time, but your setup is zero-shot" — which is more useful than cheerful agreement.

Planning experiments. Once you've settled on an approach, Claude Code helps you write the plan. What's the hypothesis? What's the baseline? What datasets will you use? What metrics matter? What does success look like? What are the hyperparameters? This plan gets written into a project-level CLAUDE.md as the experiment configuration — a reference that Claude Code will consult at every subsequent phase. It's also your record of intent: when you're analyzing results three weeks later, you'll know exactly what you were trying to test.

The key thing about Phase 1: your GPU servers are completely uninvolved. You're not burning compute. You're not occupying hardware that your labmates need. You're thinking, reading, and planning — the highest-leverage activities in research. The GPUs will come later, and when they do, they'll be used efficiently because you planned first.


Phase 2: Code Implementation (Local Machine)

With a plan in hand, Claude Code writes the code.

Project structure. Claude Code creates the directory layout: training script, evaluation script, config files, requirements.txt or environment.yml. It follows whatever conventions you've established in your CLAUDE.md — maybe you always put configs in a configs/ directory, maybe you use a specific argument parser, maybe you have a standard logging setup with WandB integration. Claude Code knows your preferences because you taught them in Chapter 6.

Training script. Claude Code writes the actual training loop, or adapts an existing one from a reference codebase. It handles data loading, model construction, optimizer setup, learning rate scheduling, checkpointing, and WandB logging. If you're building on top of an existing framework like HuggingFace Transformers or PyTorch Lightning, it modifies the right files without breaking the rest. If you're writing from scratch, it follows standard patterns that are easy to debug.

Version control. Everything goes into git. Each logical change gets its own commit — not one massive "initial commit" with 50 files, but a series of small, reviewable commits: "add data loading pipeline", "add model architecture", "add training loop with WandB logging", "add evaluation script". This isn't just good practice — it's a safety net. If something goes wrong later, you can bisect. If you need to revert the evaluation changes without touching the training code, you can. Git discipline costs nothing and saves everything.

Project-level CLAUDE.md. This is the critical piece that ties the implementation to the infrastructure. Claude Code writes a CLAUDE.md in the project root that captures everything the AI needs to know about this specific project: which server to use, which conda environment to activate, what the training command looks like, what the expected baseline metrics are, where the datasets live on the server. When Claude Code opens this project next week — or after a context compact, or in a completely new session — it reads this file and immediately knows the full picture. No ramp-up time. No "wait, where were we?"

Phase 2 is still entirely local. No server involvement. No network dependency beyond whatever Claude Code needs for its own API. You could do this on an airplane. The code exists on your local machine, version-controlled, tested with quick sanity checks if needed, ready to be deployed.


Phase 3: Sync to Server (Local → Server)

This is the bridge between your local development environment and the remote compute. Your code is ready. Now it needs to be where the GPUs are.

One command:

bash
rsync -avz --delete \
  --filter=':- .gitignore' \
  --exclude='.git/' --exclude='wandb/' --exclude='outputs/' \
  --exclude='*.pyc' --exclude='__pycache__/' \
  ~/projects/your-project/ your-server:/work/your-project/

This pushes your code to the server. Let's break down the key decisions:

Sync code only. Notice what's being excluded: .git/ (no need for git history on the server — version control lives on your local machine), wandb/ (WandB logs are uploaded to the cloud, not synced between machines), outputs/ (results stay on the server until you explicitly pull them back). You're syncing source code, config files, and small utility scripts. Nothing else.

--delete keeps remote in sync with local. When you delete a file locally, --delete removes it from the server too. This prevents stale files from causing mysterious bugs — like an old config file that overrides your new one, or a deleted module that still gets imported because the .pyc cache is gone but the .py file lingers. But --delete comes with a critical warning: if training is actively running on the server and writing output files into the project directory, --delete will remove those files. Always verify that no training is currently running before syncing with --delete.

--filter=':- .gitignore' reuses your .gitignore rules for rsync exclusion. You maintain one set of ignore patterns, not two.

Models and datasets are NOT synced. This is a common mistake that new users make. Your training data might be 100GB. Your model weights might be 10GB. You absolutely do not rsync those from your laptop over a home internet connection. They live on the server — either in a shared data directory that your lab maintains, or downloaded directly on the server using huggingface-cli download. Remember Chapter 4? You set up proxy tunneling so the server has internet access. This is one of the main reasons why. The server downloads its own weights and data at full datacenter speed.

First sync needs a directory. Before your very first rsync to a new project, create the remote directory:

bash
ssh your-server "mkdir -p /work/your-project"

Subsequent syncs are incremental — only changed files are transferred. A typical code-only sync takes 2-5 seconds, fast enough to iterate rapidly.

The rule to remember: code travels to the server. Data and weights live on the server. Results travel back. This one-directional flow keeps things clean and fast.


Phase 4: Training (Server)

Now we're on the GPU server. The code is there, the data is there, the conda environment is ready. Time to train.

The Pattern: Foreground First, Then Background

This pattern is non-negotiable. Don't skip the foreground step.

Step 1: Foreground smoke test. Claude Code SSHes to the server and runs the training script directly — not in tmux, not in the background. A straight command:

bash
ssh your-server 'cd /work/your-project && \
  conda activate your-env && \
  python train.py --epochs 1 --subset 100'

You watch the output scroll by. Does it crash immediately? Does the data loader find the files? Does the model fit in GPU memory? Does the loss start at a reasonable value and decrease? Does WandB connect and start logging?

This is the moment where 90% of problems surface. Wrong data paths, missing packages, tensor shape mismatches, OOM errors, WandB authentication failures — all of them show up in the first 30 seconds. You fix them immediately. Edit the code locally, rsync again, re-run. Iterate until the training runs cleanly for at least a few hundred steps with loss going down.

Two minutes of foreground testing saves hours of discovering problems in a tmux session that's been running unattended overnight.

Step 2: Background long run. Once you're confident the training is stable, kill the foreground process. Now restart it inside tmux on the server:

bash
tmux new-session -d -s train01 "cd /work/your-project && \
  export WANDB_API_KEY=your-key && \
  conda activate your-env && \
  python train.py --config configs/exp1.yaml"

The training is now running inside tmux. You can disconnect from the server. You can close your laptop. You can go home. The training continues because tmux keeps the session alive (Chapter 3), and the proxy tunnel stays up because of SSH ControlMaster (Chapter 4).

Step 3: Monitoring. This is where Chapter 7 pays off. The watchdog script monitors GPU utilization and process health on the server. CronCreate on your local machine checks the watchdog's summary file every 10-15 minutes. If everything is normal, nothing happens — no notifications, no interruptions, no noise. If the training crashes, GPU utilization drops to zero, or the watchdog detects a stall, Claude Code is alerted and can intervene: read the error log, diagnose the problem, fix the config, and restart.

WandB: Your Remote Dashboard

Every training script should log to Weights & Biases. This isn't optional decoration — it's what makes the entire remote workflow practical.

WandB gives you a web dashboard that shows training loss, validation metrics, learning rate, GPU utilization, and whatever custom metrics you log. You can check this from your phone's browser. No SSH required, no Termius, no command line. Just open the URL and see the loss curve.

This is your passive monitoring channel. Active monitoring (watchdog + CronCreate) catches crashes and errors. Passive monitoring (WandB dashboard) tells you whether the training is going well even when nothing is wrong. A loss curve that's flattening early, a validation accuracy that's plateauing — these aren't "errors" that watchdog would catch, but they're signals that your next experiment should change something.

Claude Code can also query WandB programmatically, pulling metrics for comparison and analysis without you having to read charts.

When Things Go Wrong

Training crashes happen. OOM errors, NaN losses, data loader exceptions, NCCL timeouts on multi-GPU runs. The question isn't whether they'll happen — it's how quickly they're detected and fixed.

With this system, detection is automatic (watchdog + CronCreate). The fix depends on the problem:

  • OOM: Claude Code edits the config to reduce batch size, adjusts gradient accumulation steps to keep the effective batch size the same, rsyncs the change, and restarts from the latest checkpoint.
  • NaN loss: More investigation needed. Claude Code reads the logs, checks if it's a learning rate issue (too high), a data issue (corrupted sample), or a numerical instability (missing gradient clipping). It proposes a fix, and either applies it automatically or asks you if the fix involves a meaningful experimental decision.
  • Data loader crash: Usually a corrupted sample or a path issue. Claude Code identifies the problematic file, adds it to an exclusion list or fixes the path, and restarts.
  • NCCL timeout: Multi-GPU communication failure. Often transient. Claude Code restarts the training from the latest checkpoint.

The goal: most crashes are handled without waking you up. You check WandB in the morning and see that training continued with a brief 2-minute interruption at 3am. That's the system working as designed.


Phase 5: Results & Analysis (Server → Local)

Training is done. Results are on the server. Time to bring them home.

bash
rsync -avz \
  --exclude='checkpoint-*/' --exclude='*.safetensors' --exclude='*.bin' \
  your-server:/work/your-project/outputs/ \
  ~/projects/your-project/outputs/

Notice the exclusions: you pull back logs, metrics, evaluation results, and any generated outputs. You do NOT pull back model checkpoints — they're enormous (often gigabytes each) and you don't need them on your local machine. If you need to run more evaluation or generate more outputs, do it on the server where the checkpoints already live.

Claude Code analyzes results locally. It reads the training logs, parses the metrics, compares against baseline numbers, and gives you a structured summary: "Experiment A achieved 94.2% accuracy on ImageNet validation, which is 1.3 points below the baseline paper's reported 95.5%. The loss curve shows instability in the first 1000 steps — the learning rate warmup may be too short. Experiment B with the modified schedule reached 95.1%, much closer to the target."

The iteration decision. Based on the analysis, one of three things happens:

  1. Results are bad. Back to Phase 2. Change the code, adjust hyperparameters, try a different approach. Sync, train, analyze again. The cycle repeats.
  2. Results are promising but not final. More experiments: different seeds for statistical significance, ablation studies to isolate what matters, different datasets to test generalization. More Phase 3-4-5 cycles, but with less code change — mostly config tweaks.
  3. Results are good. Write the paper, or run the final comprehensive set of experiments that will go in the paper. This is where you shift from exploration to consolidation.

This iterative cycle — plan, code, sync, train, analyze, repeat — is the heartbeat of computational research. The system makes each iteration faster and less painful. What used to take a full day of manual work (code, debug SSH issues, babysit the first hour, set an alarm for midnight, wake up to check, manually restart if crashed) becomes a cycle that runs overnight while you sleep.


The Daily Routine

What does life actually look like when you're running this system? Here's a typical day:

Morning. You wake up, reach for your phone, and open WandB in your browser. The loss curve from last night's training run looks smooth — it went all night without crashing. No anomalies in the metrics. You don't need to SSH in. You don't need to check anything else. You shower, make coffee, and head to your desk knowing that your GPUs were productive for the full 10 hours you were away.

Late morning. The overnight training finishes. Claude Code, through its regular CronCreate check, detects that the tmux process has exited. You sit down and ask Claude Code to pull the results. It rsyncs the output logs and metrics, analyzes them, and presents a summary: "Run completed. 50 epochs, best validation accuracy 93.7% at epoch 42. Below the target of 95.5%. The learning rate schedule appears too aggressive — validation loss starts increasing at epoch 35, suggesting overfitting in the later stages."

Afternoon. Based on the analysis, you and Claude Code decide on the next experiment. Reduce the learning rate by half, add cosine annealing, extend warmup from 500 to 2000 steps. Claude Code edits the config, you glance at the diff to confirm it makes sense, it commits to git and rsyncs to the server. A quick foreground smoke test — loss is decreasing, WandB is logging, memory usage is fine. Claude Code launches the next training run in tmux, the watchdog picks it up automatically, and CronCreate is already checking.

Evening. Training is running. You leave. You don't check your phone every 20 minutes. You have dinner with friends. You watch a movie. If something goes catastrophically wrong, Claude Code handles the immediate triage, and you'll see the resolution in tomorrow morning's WandB check.

Night. Claude Code monitors through CronCreate. The watchdog watches. WandB logs every metric. You sleep the whole night through.

Weekend. The same pattern, but more relaxed. You check WandB from your phone once or twice a day. If results came in from a Friday evening run, you might spend 20 minutes on Saturday discussing next steps with Claude Code from the couch — through Termius on your phone. Or you might just let the experiments run and deal with everything on Monday. Your GPUs don't take weekends off, but you can.

The defining feature of this routine is the absence of anxiety. You don't worry about silent failures because the watchdog catches them. You don't worry about wasted GPU time because Claude Code restarts crashed training. You don't cancel plans because the system runs without you. The GPU babysitting problem from Chapter 1 is solved — not by making you a better babysitter, but by removing you from the loop entirely.


Running Multiple Experiments

Once you're comfortable with the single-experiment workflow, you'll naturally want to scale up. You have multiple GPUs, maybe multiple servers — why leave any of them idle?

One server, multiple experiments. If your server has 8 GPUs, you might run two 4-GPU experiments simultaneously, or one 6-GPU experiment and one 2-GPU experiment. Each gets its own tmux session on the server (train01, train02). Each logs to its own WandB run. The single watchdog process on that server monitors all active tmux sessions and reports their collective status in one summary file.

Multiple servers, multiple experiments. Scale this across servers. Each server has its own tmux sessions, its own watchdog instance, its own set of experiments. On your local machine, Claude Code manages all of them — it knows which experiments are running where because it wrote the project CLAUDE.md and the CronCreate jobs. One cron check per server reads that server's summary file, which covers every experiment on that machine. Two servers means two cron jobs, not ten.

The management overhead stays flat. This is the critical insight. Running 6 experiments doesn't require 6 times the effort. Claude Code does the multiplexing. You see a summary: "Server A: train01 running (epoch 34/50), train02 finished (results ready). Server B: train03 running (epoch 12/50), train04 crashed and restarted at epoch 8." One glance and you know the state of everything.

A practical limit. You can realistically manage 4-6 concurrent experiments across 2-3 servers without the complexity becoming counterproductive. Beyond that, you risk losing track of what each experiment is actually testing, and the cognitive overhead of interpreting results from 10 simultaneous runs negates the time savings. Start with one experiment. Add more as you develop the muscle memory.


Common Patterns

The Ablation Sweep

You need to test 5 different configurations to figure out which component of your method actually matters. Instead of running them sequentially:

  1. Claude Code prepares all 5 config files (changing one variable at a time)
  2. Claude Code distributes them across available GPUs — maybe 2 on Server A, 3 on Server B
  3. Each runs in its own tmux session with WandB logging
  4. Watchdog monitors all 5 simultaneously
  5. As each finishes, Claude Code logs the result
  6. When all 5 are done, Claude Code compiles a comparison table

You might be at dinner when the last one finishes. You check results over breakfast.

The Overnight Training

The experiment will take 14 hours. It's 6pm.

The old way: launch it, go home, set an alarm for midnight to SSH in and check, set another alarm for 6am to check again. Spend the evening half-distracted, wondering if the data loader hit a bad sample yet.

The new way: launch it. Foreground smoke test passes. Put it in tmux. Go home. Sleep soundly. If it crashes at 2am, Claude Code fixes it. If it finishes at 4am, the results wait for you. Check WandB over your morning coffee: clean loss curve, no interruptions. Start analyzing.

The Conference Deadline Sprint

Three days before the submission deadline. You need results from 8 experiments to fill the tables in your paper. Claude Code is running experiments on 3 servers simultaneously, monitoring all of them, fixing crashes, and organizing results into a structured format. You're writing the paper — the introduction, the method section, the related work. Every few hours you check your phone: "How are the experiments going?" Claude Code gives you a two-line update. Results flow in. Tables get filled. You submit on time, having slept every night.


Checkpoint

You should now be able to describe the complete five-phase pipeline: Idea, Code, Sync, Train, Results. If someone asks "how does your automated research setup work?", your answer takes 60 seconds:

"Claude Code writes code on my local machine, rsyncs it to the GPU server, runs a quick smoke test, then launches training in tmux for the long run. A watchdog script monitors GPU utilization and process health. Periodic cron checks surface any problems automatically. WandB gives me a dashboard I can check from my phone. When training finishes, I pull results back locally and Claude Code analyzes them. If something crashes at 3am, it gets fixed without me. My GPUs never sit idle, and I sleep through the night."

That's the system. Five phases, three machines, one workflow. You have the map. In the next chapter, you'll walk the territory — running a real experiment end-to-end from scratch.

Released under the MIT License.