How Claude Scores a Threat: A Walkthrough of Our Severity Reasoning

This post is for the AI engineers and threat-intel analysts who want to know what's under the hood. It walks through how SignalGuard uses Claude Haiku 4.5 to score severity across four signal pillars, what the prompt structure looks like, what we feed in and what we get out, and where the model is genuinely good versus where the architecture has to compensate.

If you're a buyer rather than a builder, the severity calibration post is the version pitched at your altitude. This one assumes you're comfortable with prompts, JSON schemas, and the trade-offs between completion-tuning and structured-output evals.

The architectural shape

A SignalGuard scan fans out 23 parallel API calls — surface-web chatter (X, Reddit, Bluesky, Mastodon, YouTube, TikTok via the Research API, Telegram, GDELT news), environment signals (NWS, NOAA SPC, AccuWeather if BYOK'd, NASA FIRMS, USGS), movement signals (FAA TFRs, FAA NOTAMs, OpenSky, TomTom if BYOK'd, Broadcastify if BYOK'd, Downdetector if BYOK'd), and context signals (DHS NTAS, FBI baseline, FEMA, Ticketmaster, Google Places).

The fan-out completes in 15–40 seconds, gated by the slowest provider. The results land as 23 structured payloads. From that point, the work is synthesis.

We use Claude Haiku 4.5 for two distinct synthesis tasks:

Per-post threat classification on the chatter pillar — given a single post, score whether it's threat-relevant and what category (logistics, weather-adjacent, political, security-direct, noise).
Cross-pillar executive synthesis — given the four pillars' worth of scored evidence, produce a composite severity, a pillar-decomposed severity, and a reasoning trace explaining the score.

The first task is high-volume, low-latency (~200ms per post, batched), and the model's job is fast structured classification. The second task is single-shot, ~3–4 second latency, and the model's job is calibrated synthesis. Different prompt shapes, different evaluation approaches.

The chatter-classification prompt

For per-post classification, the prompt looks roughly like this (anonymized and shortened):

You are classifying a single social post for relevance to event-security
threat intelligence at a specific venue.

VENUE CONTEXT:
- Venue: {venue_name}
- Location: {lat,lng}, {city}, {country}
- Event window: {start_time} to {end_time}
- Capacity: {capacity}

POST:
- Platform: {platform}
- Author: {author_handle} (followers: {follower_count})
- Posted: {timestamp}
- Text: {post_text}
- Engagement: {likes}, {reshares}, {replies}

Classify the post against the following schema:
- relevance: "high" | "medium" | "low" | "noise"
- category: "logistics" | "weather" | "political" | "security_direct"
           | "operational" | "noise"
- severity_contribution: 0–10 integer
- reasoning: one sentence, concrete reference to the post text

Return JSON only. No prose. If the post is noise, severity_contribution is 0
and category is "noise" — do not over-classify.

Two things worth flagging. First, the explicit instruction to under-classify when in doubt. Severity calibration drift is a real failure mode, and the cheapest fix in the prompt is to anchor the model toward "noise" as the default rather than toward "medium." Second, the requirement to reference the post text in the reasoning. This kills two failure modes at once: hallucinated reasoning that doesn't track the input, and the model producing generic explanations that look correct but aren't grounded.

We run this in structured-output mode against the JSON schema, which means the model returns valid JSON ~99.7% of the time in our internal benchmarks. The remaining 0.3% gets caught by a Pydantic-shaped validator and re-prompted with the error message.

The synthesis prompt

For cross-pillar synthesis, the prompt is structurally different. Input is the full set of 50+ signal payloads, each pre-aggregated to a per-pillar evidence summary. Output is the composite severity, pillar-decomposed severities, the reasoning trace, and a short list of "actions to consider" (intentionally not "actions to take" — the call is the operator's, not the model's).

The structure:

You are producing an executive threat brief for an event-security operator.
Severity ladder: Clear / Low / Medium / High / Critical. The operator will
read your reasoning and make the operational call.

EVIDENCE BY PILLAR:

CHATTER:
{chatter_evidence_summary}

ENVIRONMENT:
{environment_evidence_summary}

MOVEMENT:
{movement_evidence_summary}

CONTEXT:
{context_evidence_summary}

VENUE CONTEXT:
{venue_metadata}

Produce:
1. composite_severity: one of {Clear,Low,Medium,High,Critical}
2. pillar_severities: {chatter,environment,movement,context} each one of the same scale
3. reasoning: 2-4 sentences, must reference at least two pillars by name and
   at least one specific signal by source
4. compounding_evidence: brief list of where two-or-more pillars reinforce
5. actions_to_consider: 2-5 bullet items, each starting with a verb

Constraints:
- A single pillar at High does not produce composite Critical unless
  another pillar is also at High or above.
- If chatter has elevated severity but environment, movement, and context
  are clear, downgrade composite by one notch.
- Reasoning must be specific. Do not say "various signals" — name them.

The constraints in the prompt are what carry the calibration story from the calibration post. The model could, in principle, output a Critical based purely on a single high-chatter pillar. The constraint stops it. We've tested this rigorously — the cross-pillar gate is the single most important calibration mechanism in the synthesis layer.

A walkthrough of a real reasoning trace

From the same composite scenario used in the 90-minutes timeline post, the synthesis output at 16:18 looked like this:

{
  "composite_severity": "High",
  "pillar_severities": {
    "chatter": "Medium",
    "environment": "High",
    "movement": "High",
    "context": "Medium"
  },
  "reasoning": "Compounding signal across three pillars. Environment trajectory: storm leading edge revised to 18:30 ± 20m per AccuWeather Lightning + NWS Severe Thunderstorm Watch. Movement: Downdetector cell-saturation spike (+340% baseline) on local carrier indicates crowd at or near operational capacity. Context: Ticketmaster Discovery flagged 12K-capacity event 2.4mi at 20:30 which will degrade egress routes 21:30–23:00.",
  "compounding_evidence": [
    "Lightning timing + saturated egress routes",
    "Pre-peak crowd density + adjacent-event egress congestion"
  ],
  "actions_to_consider": [
    "Pre-position evacuation routing",
    "Halt incoming gate ingress at 17:00",
    "Cue weather-hold protocol at 18:00",
    "Coordinate with adjacent venue ops on egress sequencing",
    "Update local PD on revised peak window"
  ]
}

A few things to notice. The reasoning references three pillars by name and three specific signal sources (AccuWeather Lightning, NWS Severe Thunderstorm Watch, Downdetector, Ticketmaster Discovery — actually four sources). The compounding evidence is two-line, not paragraph-form, because the operator needs to scan it in ten seconds. The actions all start with verbs and stop at five items because longer lists don't get read in operational conditions.

The composite is High, not Critical. Two pillars at High would qualify for Critical under our prompt constraints, but in this case the synthesis correctly identified that one of the High-pillar signals (Movement / Downdetector) is a leading indicator, not a current state — the saturation is signaling future risk in 30 minutes, not present risk now. The model nudged the composite down by one notch in line with the constraint guidance. Whether that's the right call is a calibration question we still review.

Where Haiku is genuinely good

Three places.

First, structured outputs against a constrained schema. Haiku 4.5 returns valid JSON at extremely high rates and doesn't hallucinate fields. This was painful with earlier models and is now boring, which is the right state for an infrastructure dependency.

Second, anchoring outputs in evidence. When the prompt requires references to specific signals, Haiku is reliable about producing references that exist in the input, not invented ones. We measured this against a held-out set — 98.4% of severity reasoning references map to real signals in the input. The 1.6% failure mode is paraphrase-level (slight rewording of the source), not hallucination.

Third, calibrated downgrading. Haiku follows under-classification instructions well, which is the opposite of the failure mode most LLMs exhibit. Most models default to over-confident, over-strong classifications. Haiku, properly prompted, defaults to conservative classifications.

Where the architecture compensates

Two places.

First, base-rate awareness. The model doesn't natively know that 90% of NOTAMs in a venue radius are operationally irrelevant. It would, given a NOTAM, treat it as a signal. We compensate by pre-filtering NOTAMs at the data layer — only TFRs, only NOTAMs above a relevance threshold, only NOTAMs in a tightened radius. The model sees a filtered input, not the raw feed.

Second, temporal context. Haiku doesn't have native access to the venue's history. A Downdetector +340% spike means something different at venues that routinely see those spikes versus venues that don't. We pass a rolling 14-day baseline summary into the synthesis prompt so the model can reason against the baseline rather than the absolute number.

Both of these are architecture decisions, not prompt decisions. The lesson is that the model is one component in a calibrated pipeline, not the pipeline itself.

Why Haiku 4.5 specifically

We've tested Sonnet, Opus, and Haiku across the synthesis task. Sonnet is marginally better on the reasoning quality. Opus is meaningfully better on edge cases. Haiku is 3–5x cheaper per scan and 2–4x faster, and at the calibration thresholds we care about, the precision differences between Haiku and Sonnet are below the noise floor on our validation set.

For an infrastructure dependency that fans out across thousands of scans per customer per month, those cost and latency differences compound. The right model is the cheapest one that meets the calibration bar. Haiku 4.5 does.

If you want to inspect the reasoning trace on a live scan, every scan on /scan includes the expandable reasoning panel by default. The synthesis prompt structure is on our docs under the "Severity model" section.