Severity Calibration Is the Hardest Part of Threat Intelligence

Ask a threat-intel platform vendor what their accuracy rate is and they'll usually quote a recall number. "We caught 94% of incidents in the validation set." That's the wrong number. The number that matters is precision at the severity threshold you act on, and the deeper number is the false-positive rate, and the rate underneath that is what calibration drift looks like at month three of deployment.

This post is the case for treating severity calibration as the central engineering problem of the category, not the marketing one.

The base-rate trap

Pick a venue. Pick a year. Count the events at that venue that required an actual operational response — a weather hold, a delayed gate, a partial evacuation, a coordination call with local PD. For most venues running 60–200 events a year, that's somewhere between 4 and 12 events. The base rate of "something operationally significant happens" is between 2% and 20% of event-days.

Now imagine a threat-intelligence system that fires a "high severity" alert on 10% of event-days. Naively, that sounds calibrated. If your base rate is 8%, you're roughly matched. In practice, you're not — because the events you flag are not the events that matter, and after three weeks your operations team has stopped looking at the alert color.

The base-rate trap is the reason most threat-intel deployments fail in months 2–4. Not because the technology is wrong. Because the calibration drifted, the team stopped reading the high pills as high, and the system became wallpaper.

The four-pillar trick

The reason SignalGuard fuses signals across four pillars instead of running a single composite score is calibration-driven, not architectural. A high severity on a single pillar is, statistically, mostly noise. A high severity on two pillars is a small but real signal. A high severity on three pillars compounds nonlinearly into something that warrants a call.

This isn't novel — every threat-intel analyst worth talking to already does this in their head. The product question is whether you surface the pillar decomposition or hide it behind the composite. We surface it. The four chips on the dashboard — Chatter, Environment, Movement, Context — are how a calibrated reader sees through a Medium-composite that's actually a Critical-Environment + Low-everything-else (which is one decision) versus a Medium-Chatter + Medium-Movement + Medium-Context (which is a totally different decision).

What we got wrong in v1

For roughly the first eight months of the product, SignalGuard's severity scoring was monotonic in the number of high-signal events surfaced. More flagged Reddit posts, higher severity. More NOTAMs in the radius, higher severity. It seems obviously correct. It is also obviously wrong.

Here's why. Reddit posts cluster in time around events that are already public knowledge (an artist gets in the news, the venue subreddit lights up). NOTAMs in a venue radius are mostly TFR-adjacent VIP movements that have no operational implication for the venue itself. A scoring function that rewards signal volume rewards events that already have public attention, which is the opposite of what a threat-intel system should do.

The recalibration we shipped in version 2 introduced two changes. First, severity per pillar is bounded — a single pillar tops out at "High" unless cross-pillar confirmation is present. Second, the synthesis layer (which is where the Claude Haiku 4.5 reasoning trace lives) gets explicit instructions to discount signal that's already common knowledge.

The result was a 38% reduction in High/Critical alerts on the validation set, with no measurable change in the operationally-significant catch rate. False positives went down. True positives stayed flat. Which is the only honest calibration story.

The false-positive rate is the product

The KPI that matters in this category is what we call "operator trust at month six." If the security director still reads the dashboard at month six, the product worked. If they've moved it to a secondary tab, the product failed, regardless of whether the catch rate is technically high.

The mechanism for sustaining operator trust is precision at the action threshold. If "High" means "stop what you're doing and read the brief," then High needs to mean that 80%+ of the time. We currently sit around 71% in internal validation against incidents-that-required-response. We are not done.

What this looks like in the product:

Severity pills are visible on every page. Clear / Low / Medium / High / Critical, same five words, same five colors, every surface. If the operator sees a Medium on a card, it means the same thing as a Medium on the dashboard, which means the same thing as a Medium in the PDF export.
Reasoning traces are visible by default. Every severity score on a scan brief has an expandable "why this score" panel. The synthesis text is generated; the underlying signal references are deterministic.
The audit log captures every severity transition. On /audit, an operator can trace a Medium → High transition back to the specific signal that moved it. This matters legally (after-action review) and matters operationally (catching calibration drift).

What honest calibration looks like

The honest version of this conversation is: any vendor in this category that won't tell you their false-positive rate at the action threshold is selling you a number that hasn't been validated. Ask. Specifically: "What percentage of your High-severity events in the last 90 days resulted in an operator-confirmed response?" If they can't answer, the calibration is uncalibrated.

We share ours on request, with the caveat that 71% is what we currently measure and we're working to improve it. The system that gets to 90% is also the system that under-fires, which is its own failure mode. The calibration problem doesn't have a closed-form solution. It has a regime: keep the precision at action-threshold visible to the operator, keep the pillar decomposition visible, keep the audit trail intact. The drift will happen. The product is what catches it.

If you want to see how the four-pillar decomposition reads in practice, run a free scan at /scan and look at the brief structure. The composite score is the headline. The pillar chips are the actual signal.