Controlling AI token usage without sacrificing security outcomes

Can AI-driven automation bridge the cybersecurity skills gap effectively?

AI in the SOC is no longer a “nice to have”. It is rapidly becoming part of day-to-day triage, investigation, summarisation and even remediation. The catch is that many security teams discover the cost problem after the success problem: usage climbs, token consumption spikes, and budgets get blown without any clear link to risk reduction.

This article lays out practical, security-first ways to control AI token usage while maintaining high detection quality and response speed. It is written for CISOs, CTOs, security managers, and SOC leads who need governance that works in real operations, not just in policy documents.

Tokens in plain terms (and why security teams should care)

In modern LLMs, tokens are the metered unit of text that the model reads and writes. Your cost typically tracks:

Input tokens: what you send (prompts, alert context, logs, playbook steps)
Output tokens: what the model returns (summaries, next steps, recommended queries)
Hidden multipliers: retries, tool calls, multi-step “agent” loops, long context windows, and verbose outputs

In a SOC workflow, token usage grows quickly because we tend to attach “just in case” context: full alert payloads, multiple log samples, user history, asset inventories, and past tickets. That can be useful, but it is rarely free.

The most common token budget failure in security operations is not “bad prompts”. It is an unbounded context combined with always-on AI in high-volume queues.

Why uncontrolled token usage derails AI budgets in the SOC

Token spend becomes unpredictable when:

Every alert gets the same heavyweight AI treatment, even when 60–80% are routine.
Prompts are inconsistent across analysts, shifts, and teams (driving output bloat and rework).
Large models are used for simple tasks (summarising, classifying, extracting fields).
AI systems retry silently due to timeouts, tool failures, or poor routing.
Success metrics focus on “activity” (how many AI runs) rather than on security outcomes (how many incidents were closed faster, how many true positives were found).

When this happens, the CFO sees rising AI costs, while the CISO struggles to prove impact. The result is often blunt cost-cutting that harms response quality.

The core principle: link tokens to risk reduction, not curiosity

A workable model is to treat tokens like any other operational resource: you allocate them where they materially reduce risk or time-to-containment.

Instead of “How do we reduce tokens?”, ask:

Which AI actions reduce mean time to triage (MTTT) and mean time to respond (MTTR)?
Which actions improve detection fidelity (fewer misses, better prioritisation)?
Which actions prevent analyst fatigue without masking high-risk cases?

If you can map token spend to these outcomes, governance becomes credible and easier to defend during procurement and renewal.

Prompt discipline that cuts costs without cutting capability

Prompt discipline is not about making prompts shorter at all costs. It is about making them bound, structured, and repeatable.

Use bounded context blocks.

Give the model exactly what it needs for the specific decision, not the entire incident universe.

Prefer a short, curated “incident brief” over raw event dumps.
Attach a maximum number of log lines (for example: “top 50 most recent, plus 10 around the suspected execution time”).
Strip fields that rarely affect triage (large nested JSON, verbose metadata).

Constrain output deliberately

A common token leak is an overly helpful model. Ask for the minimum useful output:

“Return 5 bullet points max”
“Return a one-paragraph summary and 3 next actions”
“If confidence is low, ask one clarifying question only”

Standardise prompts by workflow, not by person

Create a small library of SOC prompts aligned to repeatable tasks:

Alert classification
Incident summary for ticketing
“What changed?” analysis
Recommended containment steps
Executive update (tight, non-technical)

This reduces output variance and prevents “prompt inflation” over time.

Example: a tight triage prompt pattern you can standardise

You are assisting SOC triage. Use only the provided context.
Goal: decide whether this alert is actionable and what to do next.

Return exactly:

Decision: (benign/suspicious/likely incident)
Why: 3 bullets, each under 18 words
Next actions: 3 steps, each starting with a verb
Confidence: high/medium/low

If context is insufficient, ask one clarifying question and stop.

Role-based token quotas that feel fair (and actually work)

Quotas fail when they are arbitrary. They succeed when they mirror operational responsibility and risk.

A practical approach:

Tier 1 analysts: smaller per-shift quota with strong routing and tight prompts (high volume, repeatable tasks)
Tier 2/3 analysts: higher quota for deeper investigations (lower volume, higher complexity)
Incident commanders / on-call leads: reserved quota for major incidents (burst capacity)
Engineering/detection: separate quota bucket for tuning, experimentation, and content creation (so ops isn’t penalised)

The goal is not to ration intelligence. It is to stop the “default to expensive AI” habit.

Combine quotas with “approved high-impact actions” (for example, malware triage, lateral movement checks, containment guidance). This keeps teams focused on outcomes, not gaming the meter.

Route simple tasks to smaller models (and save large models for high-risk work)

Many SOC tasks do not require the most capable model. The best cost control is model routing:

Small/fast model for:
- Field extraction (entities, IPs, usernames, hostnames)
- Simple classification (phishing vs non-phishing)
- Formatting (ticket templates, short summaries)
Mid-tier model for:
- Correlating a handful of signals
- Drafting investigation steps
- Producing a concise incident narrative
Large model for:
- Complex multi-source reasoning
- Novel attacker behaviour
- Cross-incident correlation and hypothesis testing
- High-stakes response guidance where accuracy matters most

Routing should be automatic and based on alert severity, asset criticality, and uncertainty, not analyst preference.

Observability: measure cost per incident, per triage, and per user

If you cannot see token spend at the unit-of-work level, you cannot govern it.

Build (or demand) observability that answers:

Cost per alert triage
Cost per confirmed incident
Cost per closed incident
Cost per user / per team / per shift
Tokens per workflow (triage vs investigation vs reporting)
Rework rate (how often the AI output required a second run)

Then tie those metrics to outcomes:

Time saved per incident
Reduction in escalation volume
Improved true positive rate
Faster containment of high-severity cases

A simple governance metric that executives understand

Use a single, outcome-linked indicator alongside raw spend, such as:

Tokens per confirmed incident closed (trend should fall over time)
£ per incident closed (paired with MTTR improvements)
High-severity incidents handled within SLA per £ of AI spend

These are not perfect, but they move the conversation from “AI is expensive” to “AI is efficiently reducing risk”.

Balancing time to first token, latency, and accuracy in SOC workflows

Security operations have a different performance profile from general enterprise use. The “best” model is not always the one with the highest benchmark score.

Where speed matters most

First-pass triage in high-volume queues
Analyst assist for next-step recommendations
Rapid incident updates during active response

In these moments, favour low latency and predictable behaviour, even if the output is simpler.

Where accuracy matters most

Containment decisions that could disrupt business operations
Major incident coordination
Executive and regulatory reporting language
High-confidence attribution statements (often better avoided entirely)

In these moments, spend tokens on stronger reasoning, more context, and stricter validation.

A fast, wrong answer is worse than no answer when it triggers unnecessary containment or hides a real compromise. Token governance must include a “high-stakes accuracy override”.

Design a token governance blueprint for AI security policies

Token governance works best when it is integrated into existing security and procurement controls, rather than bolted on as a finance-only limiter.

Define “approved AI use cases” in the SOC

Start with a short, reviewed list:

Allowed: summarisation for tickets, evidence checklists, recommended next actions, enrichment requests
Restricted: autonomous remediation, direct blocking actions, irreversible changes without human approval
Prohibited (or tightly controlled): sensitive data re-prompting, uncontrolled log exports, unapproved connectors

Create workflow-level guardrails

Maximum context size for each workflow
Output format constraints
Escalation rules for uncertainty (“ask one question, then escalate”)
Mandatory citations to internal evidence (where your tooling supports it)

Implement role-based access and spend controls

Quotas
Burst allocation for incidents
Exception handling with audit trails

Require observability and auditability from vendors.

In procurement, ask for:

Token usage reporting at incident/alert/user granularity
Admin controls for routing and prompt templates
Data residency and tenant isolation assurances
Clear retention and data handling policies
Evidence of SOC-specific design (not generic chat tooling)

Review monthly like any other SOC KPI

Make it part of the operating rhythm:

What drove token spikes?
Which workflows are inefficient?
Which prompts or automations reduced MTTR?
Where did the AI output cause rework?

Practical cost controls that do not weaken detection

A small set of controls typically delivers most of the savings:

Context minimisation: summarise first, then reason
Two-step patterns: small model extracts facts; larger model reasons only when needed
Stop conditions: cap retries and agent loops
Output limits: force concise responses
Routing by risk: reserve large models for high-severity or high-uncertainty cases
Template governance: curated prompts per workflow, versioned and reviewed

Used together, these controls reduce spend while improving consistency, which is often the hidden driver of SOC performance.

How this applies in Microsoft-led SOCs (including Sentinel)

If you operate in a Microsoft security stack, the volume challenge is familiar: alerts can be frequent, context can be wide, and triage quality varies by shift.

Token governance is particularly effective when paired with:

Strong incident normalisation (so AI doesn’t “read” noisy variability)
Clear routing rules for common alert families
KQL-free or KQL-assisted workflows, where query generation is controlled and measured

The goal is simple: keep triage fast and consistent without turning every alert into an expensive research project.

A pragmatic next step for CISOs and SOC leaders

If you want a fast starting point, run a two-week baseline:

Measure tokens per alert triage, tokens per confirmed incident, and rework rate.
Standardise prompts for your top 5 alert types.
Introduce routing: small model first, large model only when severity or uncertainty triggers it.
Add role-based quotas with an incident burst mechanism.
Review results with both Security and Finance in the same meeting.

This creates an operational feedback loop and makes AI spend governable without slowing the SOC.

If you are preparing an AI SOC procurement or policy refresh, consider adding token observability and workflow-level governance requirements to your vendor checklist. It is one of the few levers that improve both cost control and operational discipline.

Written By:

Cymon Skinner

SaaS

Experts

Harvey®

AI SOC

SOC

Incident

Skills Gap

SecQube®

Try today

SaaS

Your Cart

Controlling AI token usage without sacrificing security outcomes

Tokens in plain terms (and why security teams should care)

Why uncontrolled token usage derails AI budgets in the SOC

The core principle: link tokens to risk reduction, not curiosity