LLM Workflow · End-to-End Scenario System

Natural language to runnable traffic simulations

This project converts vague traffic requests into executable SUMO scenarios by separating structured parameter extraction from open-ended geometry reasoning, then keeping human corrections as a reusable retraining signal instead of losing them in chat history.

The project covers fine-tuned traffic-parameter extraction from natural-language requests, base-LLM geometry reasoning and XML fallback, and Seoul traffic-data grounding with statistical fallbacks. It also includes execution orchestration, held-out evaluation, correction logging, retraining export, and live admin review.

What this project actually includes
Fine-Tuned LLM
Extract structured traffic parameters from language
The core model is fine-tuned to turn natural-language scene descriptions into speed, volume, lanes, speed limit, sigma, tau, and block-length fields.
Geometry LLM
Classify edits and handle geometry fallback
This layer owns geometry reasoning: it routes edit requests and regenerates road XML when OSM or existing geometry is not enough, while the fine-tuned model stays focused on structured parameter extraction.
Agent and Review Loop
Orchestrate execution, then export trainable corrections
An 11-tool agent runs network build, simulation, and validation, while correction-intent edits are logged with trainability metadata and exported as retraining data.
OpenAI Fine-Tuning SUMO SQLite Multi-LLM Routing Harness Engineering Cloud Run GitHub Actions
The Problem

Why is this hard?

Core Challenge

Creating traffic simulations is still too inconvenient, too slow, and too unrealistic for a natural-language workflow.

Turning a traffic scene into a simulation is still inconvenient and expensive. Building it manually takes too long because road lookup, network construction, config generation, execution, and validation all have to be handled as separate steps before the user can even inspect one result.

Trying to generate the same thing with an LLM is faster, but it often does not represent the real scene well enough. Natural-language requests mix congestion, road type, lane count, speed regime, spacing, and driver behavior, and a general model often fails to convert that into realistic traffic parameters and believable road geometry.

So the result can sound plausible as text while still feeling unrealistic once it is actually run in SUMO. Manual creation takes too long, and LLM-only generation often does not reflect the traffic scene with enough fidelity to be useful.

This project asks: can fine-tuned LLM extraction and a role-separated workflow translate natural-language traffic scenes into runnable simulations with higher fidelity, while making the overall generation process fast enough to use as an actual workflow?

Workflow Cost
Simulation creation is still too manual Road lookup, network building, config generation, simulation, and validation usually require too many separate steps for prompt-driven scenario creation to feel immediate.
Parameter Failure
General LLMs miss traffic interdependencies Speed, volume, lane count, speed limit, sigma, tau, and block spacing are connected. If one field is wrong, the resulting simulation drifts away from realistic traffic behavior.
Realism Gap
Natural-language-only generation looks convincing before execution A prompt can sound correct while still producing free-flow defaults, weak geometry, or unrealistic spacing once the scenario is actually run in SUMO.
Feedback Loss
Even good human fixes are easy to waste If a reviewer corrects the result but that edit disappears into chat history, the system stays expensive to improve and keeps repeating the same mistakes.
Approach

Teach traffic-scene context from real road data, then split generation by responsibility

Core Idea

Build the missing traffic-scene dataset first, fine-tune the language-to-parameter layer, then let specialized components assemble the final simulation.

There is no ready-made dataset that cleanly maps natural-language traffic scenes to simulation-ready parameters, so the project first builds that supervision itself. Real road data from Seoul traffic detectors and synthetic traffic-engineering scenarios are turned into prompt-target pairs, so the fine-tuned model can learn how natural-language descriptions map to structured traffic context instead of only memorizing road names or generic prompt patterns.

Once that layer is learned, generation is split by role. The fine-tuned extractor produces structured traffic parameters, the geometry LLM classifies edits and handles geometry reasoning and XML fallback, and the surrounding system rebuilds the network, runs SUMO, and validates the result. Because traffic-scene data is still scarce, the system also logs correction-intent edits and exports them as retrainable data, so the model can keep improving as more real usage accumulates.

The broader project also includes live Seoul traffic lookup, representative road-type statistics, and similar-road estimation tools that support grounding and fallback around the main generation flow.

Stage 1

Build traffic-scene supervision

Extract prompt-target data from real road observations and synthetic scenarios because off-the-shelf natural-language traffic-scene datasets do not really exist.

Stage 2

Fine-tune language to traffic parameters

Teach the model to read the context of a traffic scene from natural language and output structured fields such as speed, volume, lanes, sigma, tau, and spacing.

Stage 3

Split parameter, geometry, and execution roles

Use specialized components across the fine-tuned extractor, the geometry LLM, and the tool-calling agent so each part handles the part it is best at.

Stage 4

Keep extracting retrainable data

Log correction-intent feedback, separate it from tuning requests, and export reusable training data so the system can be tuned again over time.

Fine-Tuning

Real Seoul traffic data becomes supervised prompt–parameter pairs

Core Idea

Extract observed speed and volume from detector data, estimate driver behavior parameters, then diversify prompts so the model learns traffic situations — not road-name lookup.

Training data comes from Seoul Metropolitan Government detector records (2025.10): speed data covering 31-day hourly averages per road segment, volume data with hourly counts per collection point, and the national standard node-link SHP for speed limits and road geometry. The fine-tuned model is gpt-4.1-mini via the OpenAI Fine-Tuning API. ~70 road segments × 7 time periods × 5 prompt variants = ~2,450 total pairs, split 90/10 into train (2,205) and validation (245).

flowchart LR
    A["Speed detectors\n31-day hourly avg"] --> D["Match by\nroad name & link ID"]
    B["Volume detectors\nhourly count"] --> D
    C["Node-link SHP\nspeed limit · geometry"] --> D
    D --> E["Group by 7\ntime periods"]
    E --> F["Reverse-estimate\nsigma, tau\n(Greenshields)"]
    F --> G["Generate 5\nprompt variants"]
    G --> H["train JSONL\n2,205 (90%)"]
    G --> I["val JSONL\n245 (10%)"]
          
ParameterSourceMethod
speed_kmhObserved31-day hourly average from speed detectors
volume_vphObservedHourly average from volume detectors
lanesObservedMode of lane counts across link segments
speed_limit_kmhObservedNode-link SHP MAX_SPD; road-type heuristic fallback
avg_block_mObservedMean link length from node-link geometry
sigmaEstimatedGreenshields reverse-estimation from observed speed
tauEstimatedGreenshields reverse-estimation from observed speed
reasoningGeneratedRule-based summary of the above values

Each (road, time period) pair produces 5 prompt variants so the model learns from traffic situations, not road name lookup.

Raw data row (originally Korean — translated)
Yangjae-daero suburban arterial 8-lane afternoon moderate
speed 27.0 km/h volume 4,460 vph limit 50 km/h sigma 0.40 tau 1.5 s block 219 m
5 prompts × 1 shared output = 5 training pairs
1
Simulate Yangjae-daero afternoon
2
Yangjae-daero afternoon
3
moderate suburban arterial 8-lane afternoon
4
8-lane arterial afternoon traffic simulation
5
Yangjae-daero -like arterial afternoon conditions
→ shared output
{ "speed_kmh": 27.0, "volume_vph": 4460, "lanes": 4, "speed_limit_kmh": 50,   "sigma": 0.4, "tau": 1.5, "avg_block_m": 219 }

All prompts are originally in Korean — translated for display.

StyleTemplateExample (translated)
Road name + time + action {road} {time} {action} "Simulate Yangjae-daero afternoon"
Road name + time {road} {time} "Yangjae-daero afternoon"
Situational (no name) {congestion} {area} {road_type} {lanes}-lane {time} "moderate suburban arterial 8-lane afternoon"
Generic type + time {lanes}-lane {road_type} {time} traffic simulation "8-lane arterial afternoon traffic simulation"
Mixed {road}-like {road_type} {time} conditions "Yangjae-daero-like arterial afternoon conditions"
Results

The strongest gain appears in the structured extraction layer

Field Fine-tuned Base
speed_kmh5.1%74.6%
volume_vph34.8%48.1%
lanes8.9%13.9%
speed_limit_kmh1.7%23.8%
sigma4.5%21.3%
tau4.6%11.4%
avg_block_m14.5%167.6%
Overall10.6%51.5%

Benchmark: 30 held-out prompts with labels derived from real Seoul traffic data. The qualitative shift is not just lower error, but also lower domain bias in speed and block-spacing prediction.

Headline

Overall MAPE drops from 51.5% to 10.6%

The fine-tuned extractor reduces structured prediction error by about five times on the held-out benchmark.

Largest gain

Speed bias is dramatically reduced

The base model defaults toward unrealistic free-flow speed, while the fine-tuned model brings the estimate much closer to observed traffic conditions.

Weakest field

Volume still needs more supervision

volume_vph remains the hardest field and the clearest candidate for richer future training data.

System Architecture

Prompt to runnable simulation, then back to reusable evidence

The workflow has two connected halves: an online generation path that turns requests into SUMO runs, and a review path that classifies human edits so the system can improve without contaminating its own data.

flowchart TD
    A[User Natural-Language Request] --> B[Fine-Tuned Parameter Extraction]
    B --> C[Structured Scenario Parameters]
    C --> D{Usable real location?}
    D -->|Yes| E[OSM-Based Network Retrieval]
    D -->|No / Failed| F[Geometry LLM — XML Generation]
    E --> G[SUMO Network Build]
    F --> G
    C --> H[Demand / Route / Config Generation]
    G --> I[Runnable SUMO Artifacts]
    H --> I
    I --> J[Scenario Execution]
    J --> K[Validation and Statistics]
    K --> L[User Review]
    L --> M{Intent}
    M -->|Correction| N[Trainable Signal]
    M -->|Tuning| O[Analysis Only]
    N --> P[Correction Export]
    P --> Q[Future Fine-Tuning Data]
            
flowchart LR
    subgraph Frontend
        UI["Web UI\nindex.html"]
        AdminUI["Admin Dashboard"]
        AboutUI["About Page"]
    end

    subgraph Backend["server.py"]
        SSE["SSE Streaming"]
        API["REST API"]
    end

    subgraph LLM["LLM Layer"]
        FT["Fine-tuned Model\ngpt-4.1-mini FT"]
        Base["Base LLM\nGPT / Gemini / Claude"]
        Agent["Tool-Calling Agent\n11 tools"]
    end

    subgraph Tools
        OSM["OSM Network"]
        SUMO["SUMO Generator"]
        TOPIS["TOPIS API"]
        Valid["Validator"]
    end

    subgraph Data
        DB[("SQLite DB")]
        JSONL["Training JSONL"]
    end

    UI -->|"POST /api/simulate"| SSE
    AdminUI -->|"GET /api/admin/*"| API
    SSE --> FT --> Base --> SUMO
    Agent --> Tools
    SUMO --> DB
    DB -->|"export"| JSONL
            
Input
"Create a congested morning intersection in front of a middle school."
1. FT Model — Parameter Extraction
{
  "speed_kmh": 18.5,
  "volume_vph": 2400,
  "lanes": 2,
  "speed_limit_kmh": 30,
  "sigma": 0.72,
  "tau": 0.9,
  "avg_block_m": 120,
  "reasoning": "School zone, 30km/h limit. Morning drop-off congestion, V/C ~0.85."
}
2. Network Build
OSM or LLM-generated XML → netconvert → .net.xml
3. SUMO Execution
avg speed 16.2 km/h · 2,380 vehicles inserted
4. Validation
FT predicted 18.5 km/h vs SUMO 16.2 km/h → error −12.4% → Grade B
Prompt Engineering

Constrained prompts turned format errors from 15% to zero

The fine-tuned model uses a structured system prompt that enforces strict JSON output, required fields, and value-range constraints. This is not a minor implementation detail — without these constraints the model intermittently returned prose, markdown, or partial JSON, making the downstream pipeline unreliable.

System prompt constraints (ft-v1)
  • Strict JSON-only output — no prose, no markdown, no commentary
  • All 8 numeric fields required in every response — never empty or "-"
  • Value ranges enforced: sigma 0–1, tau 0.5–3, lanes 1–8
  • Domain reasoning required in the reasoning field
  • Korean road names and locations supported
Domain calibration rules in prompt
  • School zonespeed_limit_kmh=30, sigma high (0.6+)
  • Highway / expresswayspeed_limit_kmh=80–100, avg_block_m 500+
  • Side street / alleyspeed_limit_kmh=30, lanes=1, avg_block_m 50–80
  • Rush hour — volume high, speed low
  • Late night — volume very low, speed high
Actual system prompt (ft-v1)
You are a traffic engineering expert and SUMO simulation engineer.
When the user describes a road/traffic situation, return only JSON
with the parameters needed for SUMO simulation.
You must fill all 8 fields below with numbers.
Never use empty values or the string '-'.

Output format:
{"speed_kmh": number, "volume_vph": number,
 "lanes": one-way lane count,
 "speed_limit_kmh": number,
 "sigma": between 0~1, "tau": between 0.5~3,
 "avg_block_m": intersection spacing (m),
 "reasoning": "rationale"}
Prompt evolution
VersionApproachResult
rule-v1Rule-based keyword matching, no LLMBaseline; no domain reasoning
ft-v1Fine-tuned with structured constraints0% format errors, 10.6% MAPE

The critical shift was not the model change — it was adding output constraints to the system prompt. Free-form prompting with the same fine-tuned model still produced ~15% JSON failures.

Engineering Detail

How each subsystem actually works

Greenshields reverse-estimation for sigma and tau speed → V/C → driver behavior calibration

Driver imperfection (sigma) and desired headway (tau) cannot be directly measured from detector data. They are reverse-estimated from observed speed via the Greenshields model.

The observed speed is divided by free-flow speed to get a speed ratio, which is mapped to a V/C ratio. The V/C ratio determines the congestion band, and sigma and tau are calibrated accordingly.

speed_ratio = v_observed / v_free
V/C ≈ max(0.05, 1.0 − speed_ratio × 0.85)

For example: observed 15 km/h on a 50 km/h limit road → speed_ratio = 0.33 → V/C = 0.72 → congested band → sigma 0.6–0.8, tau 0.8–1.2 s.

V/C rangesigmatau
> 0.8 (congested)0.6 – 0.80.8 – 1.2 s
0.5 – 0.8 (moderate)0.4 – 0.61.0 – 1.5 s
< 0.5 (free-flow)0.2 – 0.41.5 – 2.5 s
Tool-calling agent — 11-tool orchestration LLM tool-use · autonomous selection

The agent autonomously selects and executes tools based on user requests, using the LLM tool-use feature.

ToolDescription
search_locationGeocode area names to coordinates
build_road_networkBuild SUMO network from OSM
get_traffic_statsQuery local Seoul traffic statistics
generate_simulationGenerate SUMO config files
run_sumoExecute simulation
query_topis_speedReal-time Seoul traffic API
load_csv_dataLoad external traffic data
recommend_roadSuggest similar roads
find_similar_roadsFind roads matching criteria
validate_simulationValidate simulation output
calibrate_paramsCalibrate parameters from results

Example: input "simulate a congested commute road"

1. search_location("commute road") → no location
2. get_traffic_stats("arterial","rush hour") → refs
3. generate_simulation(params) → .net/.rou/.sumocfg
4. run_sumo(config) → avg 22.3 km/h, 1850 veh
5. validate_simulation(results) → grade B, −8.2%

The agent layer uses LLM tool-use. The base LLM is configurable across Claude, GPT, and Gemini.

Role-separated LLM design — 4 components FT extractor · geometry LLM · agent · logging

Each component owns one responsibility. The fine-tuned model does not attempt geometry; the geometry LLM does not attempt structured extraction.

llm_parser.py

Fine-tuned extractor

Parses natural language into structured simulation parameters. Provides the machine-readable target for the rest of the pipeline.

base_llm.py

Geometry LLM

Classifies each user edit as parameter, geometry, or mixed. Handles road layout reasoning and generates fallback XML when OSM fails.

agent.py

Tool-calling agent

11-tool orchestration via LLM tool-use. Autonomous tool selection based on user intent with multi-turn execution.

session_db.py

Logging and export

Stores simulation runs and modification sessions. Separates trainable corrections from non-trainable tuning. Exports retraining JSONL.

Error pattern analysis — directional bias across 30 samples base overpredicts speed +68%, block spacing +165%

Directional bias reveals where each model systematically over- or under-predicts, beyond just the MAPE number.

FieldFT BiasFT AccurateBase BiasBase Accurate
speed_kmh+0.5% (balanced)21/30+68.3% (overpredict)0/30
volume_vph+25.9% (over)13/30+21.4% (over)3/30
speed_limit_kmh−1.7% (balanced)29/30+15.7% (over)11/30
sigma−0.9% (balanced)24/30+9.9% (over)6/30
tau+1.6% (balanced)20/30+0.9% (over)3/30
avg_block_m−0.6% (balanced)20/30+165.0% (overpredict)1/30

Key finding: the base model defaults to free-flow speeds (+68.3% bias, 0/30 accurate) and lacks urban block-structure knowledge (+165% spacing). Fine-tuning corrects both. The remaining weak point is volume_vph (34.8% MAPE) — the most context-dependent field that would benefit from additional training data.

Modification classification — parameter, geometry, or mixed LLM classifier · keyword fallback

When a user requests a change, the geometry LLM first classifies it before routing to the correct handler.

The geometry LLM receives the user's edit request and returns one word: parameter, geometry, or mixed. If the LLM call fails, a keyword heuristic takes over.

User edit → Geometry LLM classifier
  → "parameter" → update speed/volume/sigma/tau
  → "geometry" → regenerate .nod.xml / .edg.xml
  → "mixed" → both paths, then rebuild
TypeExamples
parameter"lower the speed", "increase volume to 3000", "make it more congested"
geometry"add an intersection", "bend the road", "make it a 4-way crossing"
mixed"set speed limit to 70 and make road straight", "add lane and raise volume"
Correction pipeline — human fixes stored in SQLite, exported as retraining data correction vs tuning · trainable flag · JSONL export

Every modification is stored in SQLite with before/after snapshots, modification type, edit intent, and a trainability flag. Only correction-intent records are exported for retraining.

prompt → FT prediction → simulation → user review
  → Correction: stored with trainable=1
  → Tuning: stored with trainable=0

export_corrections_for_training()
  → SELECT WHERE trainable=1 AND intent='correction'
  → sessions_corrections_openai.jsonl
  → merge with train_real_openai.jsonl → re-fine-tune

This separation prevents preference-driven edits from polluting the fine-tuning signal. The admin dashboard at /admin shows correction history, modification breakdowns, and downloadable exports.

SQLite fieldPurpose
edit_intent"correction" or "tuning"
trainable1 = exportable, 0 = analysis only
modification_typeparameter / geometry / mixed
details_jsonBefore/after parameter snapshots
Data sourceSamplesGround truth
Real-data (Seoul)~2,450Observed speed/volume
Corrections (SQL)grows over timeHuman expert fixes on FT outputs
Runtime parameter wiring — how FT output becomes a SUMO simulation network XML · vType injection · two kinds of "speed"

The FT model predicts eight fields. Some define the physical road, some describe driver behavior, and speed_kmh serves as the validation target for calibration.

FT fieldTargetSUMO mechanism
speed_limit_kmh.net.xmlRewrite lane/edge speed after netconvert
lanes.net.xmlNetwork topology / capacity
avg_block_m.net.xmlIntersection spacing (generated geometry)
sigma.rou.xml vTypeKrauss driver imperfection (0–1)
tau.rou.xml vTypeDesired headway in seconds
volume_vphrandomTrips.pyTrip generation rate
max_speed.rou.xml vTypeCapped at limit × 1.05
speed_kmhNowhereValidation target only
Two kinds of "speed"

speed_limit_kmh is a legal/physical cap — "cars cannot go faster than this." Written into network XML.

speed_kmh is the predicted average speed under congestion — used for validation and calibration, not written into SUMO files.

Validation error
error = (sim_speed − FT_speed) / FT_speed × 100%

A ≤ 10% · B ≤ 20% · C ≤ 30% · D ≤ 50% · F > 50%
Automatic calibration loop — proportional control with bounded drift volume · sigma · tau · max 3 iterations · ±20% drift cap

When validation error exceeds ±10%, the calibration loop nudges behavioral parameters so the simulated speed converges toward the FT-predicted target.

Algorithm
1. Compute error = (sim − target) / target
2. If |error| ≤ 10% → converged, stop
3. Adjust proportionally:
  volume × (1 + 0.4 × error)
  sigma + 0.15 × error
  tau + 0.25 × error
4. Clamp to drift bounds
5. Re-run SUMO → repeat (max 3)

Gains are derived from SUMO Krauss model sensitivity analysis. If all parameters hit drift caps without converging, the loop stops early — the error signals a geometry mismatch rather than a parameter error.

ParameterDrift limitEffect
volume_vph±20%Primary congestion lever
sigma±0.15Driver imperfection → capacity
tau±0.3 sHeadway → throughput
Calibration loop demo
Fixed during calibration

speed_limit_kmh, lanes, avg_block_m, and network geometry are never modified. They define the physical road. Calibration only adjusts how vehicles behave on it.

Data separation

Calibrated values are stored in calibrated_params_json — separate from the original FT output. This prevents calibration artifacts from contaminating retraining data.

Simulation UI

Generate, correct, and tune — all in one conversation

One chat session covers the full cycle. First, pick which LLM handles geometry reasoning. Then describe a traffic scene — the fine-tuned model extracts structured parameters, the system builds the road network, and SUMO executes the scenario. If the result is wrong, open Correction mode: fix the parameter or geometry error, and the delta becomes retraining data for the next fine-tuning round. If the result is acceptable but you want a variant, use Tuning mode instead — the change is logged for analysis but kept out of the training signal so preference edits never pollute the dataset. An optional Calibrate button auto-adjusts behavioral parameters so the simulated speed converges toward the FT-predicted target.

The browser flow is also a real working interface, not just a demo shell: simulation progress streams live, the generated network is previewed in-chat, the latest SUMO artifact set can be downloaded as a ZIP, and, in local environments, completed scenarios can be reopened in sumo-gui for manual inspection.

LLM selection demo
Admin Dashboard

Monitor corrections, inspect error patterns, and export retraining data

The admin dashboard at /admin closes the feedback loop by surfacing what the model got wrong and making correction data directly exportable. Every correction is traceable end to end: user edit, SQLite log, admin panel, exported JSONL, re-fine-tune.

Why the admin surface matters

Without an inspection layer, corrections disappear into chat history. The admin dashboard makes error patterns visible — which fields drift, which geometry types fail, and how often corrections are trainable — so the next fine-tuning round targets the right gaps.

Three downloadable exports close the loop: Corrections JSONL for re-fine-tuning, Evaluation Report for grade distribution and fidelity, and LLM Evaluation Report for field-level error analysis and parameter deltas.

Admin dashboard showing correction rate, field error bars, and recent modifications
Dashboard Preview
Correction rates, error bars, and modification logs in one place

Top-level cards show correction rate and fail rate. The LLM evaluation panel breaks down per-field errors and geometry correction types.

Open admin dashboard
Overview

Top cards

Correction rate, mixed fail rate, total simulations, parameter/geometry correction counts at a glance.

Error Analysis

LLM Correction Evaluation

Per-field error bar charts, geometry correction type breakdown, and average parameter deltas.

History

Recent Simulations

Prompt, grade (A/B/C/D), and timestamp for the latest 30 runs.

Modifications

Recent Modifications

Edit intent, modification type, trainable flag, and user input for the latest 50 edits.

Lessons Learned

What worked, what stayed fragile, and what the project actually proves

The main win is not perfect scenario generation. It is that the system makes evaluation and future improvement structurally possible after deployment.

The contribution is the workflow split, not just the benchmark number.

Fine-tuning works best on constrained, structured outputs. Geometry and XML generation is handled by a general-purpose LLM because fine-tuning it was not feasible within the project's budget — this is the clearest remaining limitation. The second contribution is the review loop: the system preserves correction intent, session history, and export eligibility so human review can become clean retraining signal rather than one-off conversation debris.

Geometry generation is the clearest fine-tuning gap.

Road layout and XML generation still rely on in-context prompting with the geometry LLM. Fine-tuning this layer would require structured geometry datasets and significantly more API budget — a clear next step if resources allow.

Structured extraction converges fast even with limited data.

Parameter prediction with constrained JSON converged quickly on ~2,450 samples, while open-ended geometry remained too brittle to treat as the main supervised surface.

Correction versus tuning is a data-quality decision.

Without that split, preference-driven edits would silently pollute the exported retraining dataset and weaken future fine-tuning quality.

Fallbacks are part of the product, not backup code.

OSM lookups and external dependencies fail often enough that XML and geometry fallback must remain visible, supported workflow paths rather than hidden error handlers.