Skip to main content

ITSM Lifecycle Golden Path

ITSM operations are distinct from product development. OPS board = customer incidents. SPM board = product stories. The pattern bridge connects them — after PII scrub.

At a Glance

Every OPS ticket flows through an 8-step pipeline: intake → discover → classify → cross-validate → decide → implement → operate → govern. Each step validates the ticket against real infrastructure and business policy. Selective execution is supported via --steps flag.

I have a...What happensCommand
New OPS ticketFull 8-step processing with approval gates/itsm:lifecycle OPS-NNN
Specific steps onlyExecute selective subset (e.g., classify + cross-validate)/itsm:lifecycle OPS-NNN --steps classify,cross-validate
Step 1: IntakeAuto-extract service type, environment, accounts, resource IDs/itsm:lifecycle OPS-NNN --steps intake
Step 2: DiscoverInventory infrastructure with multi-account queries/itsm:lifecycle OPS-NNN --steps discover
Step 3: ClassifyAssign ticket type, priority, 6-prefix label taxonomy/itsm:classify OPS-NNN
Step 4: Cross-ValidateVerify scope against 4 independent sources/itsm:cross-validate OPS-NNN
Step 5: DecideAssess change scheduling eligibility and blast radius/itsm:lifecycle OPS-NNN --steps decide
Step 6: ImplementCreate change record + CAB change request/itsm:create-change OPS-NNN
Step 7: OperateConfigure monitoring, alarms, escalation paths/itsm:lifecycle OPS-NNN --steps operate
Step 8: GovernCost impact summary, compliance evidence trail/itsm:lifecycle OPS-NNN --steps govern
Resolved P0/P1 incidentBlameless post-incident review/itsm:create-pir OPS-NNN

All commands default to preview mode. Add --execute to apply changes to JIRA.


The Flow


Step 1: Intake

Who: sre-engineer auto-extracts. No manual input required.

What: Parse ticket description to identify service type, environment, affected accounts, and resource IDs.

Why: Consistent extraction before human judgment prevents classification errors downstream. Automation de-risks Stages A–C.

What-if skip: Incomplete service context leads to misdirected queries and incomplete blast radius assessment in Stage B.

Auto-Extraction Checklist

  • Service type identified (compute, storage, network, identity, database)
  • Environment tagged (prod, staging, dev, dr)
  • Affected account IDs extracted
  • Resource IDs isolated (instance IDs, ARNs, names)

CxO Perspective

  • CFO: "What's the cost exposure?" — account metadata supports cost allocation
  • CTO: "What systems are affected?" — service + environment scope
  • CSO: "Any security implications?" — environment tagging flags compliance-scoped resources

Quality Gate

  • All 4 checklist items populated
  • Service type verified against runbooks CLI service catalog
  • Account IDs cross-validated against Organization metadata

Step 2: Discover

Who: sre-engineer executes READONLY multi-account queries.

What: Execute inventory discovery commands for the identified service type using Config Aggregator (P1 org-wide path) or per-account profiles.

Why: Real resource state needed before classification (prevents TEMPLATE_NOT_EXECUTION anti-pattern). Decisions based on assumed state are wrong decisions.

What-if skip: Scope decisions made without seeing actual resource count, age, configuration drift, tagging status.

Discovery Scope

ServiceDiscovery ToolWhat It Returns
ComputeConfig Aggregator + EC2 DescribeInstancesInstance count, age, tags, VPC placement
StorageS3 ListBuckets + Config AggregatorBucket count, versioning status, lifecycle
NetworkConfig Aggregator + VPC DescribeFlowLogsVPC CIDR, subnet count, NACLs, SGs
DatabaseConfig Aggregator + RDS DescribeDBInstancesInstance type, engine version, backup status
IdentityIAM ListRoles + OrganizationsRole count, last-used timestamp

CxO Perspective

  • CFO: "How many resources are in scope?" — cost per resource type informs change decision
  • CTO: "Is this multi-account?" — multi-account changes trigger escalated CAB review
  • CSO: "Are there compliance-scoped resources?" — audit-trail requirements per resource

Quality Gate

  • Org-wide discovery attempted first (Config Aggregator query succeeds)
  • If 0 results: cross-validated against per-account fallback
  • Resource count documented (prevents CROSS_ACCOUNT_SILENT_ZERO)
  • Discovery metadata attached to JIRA ticket

Step 3: Classification (/itsm:classify)

Who: sre-engineer classifies. HITL approves label set before JIRA update.

What: Decision tree assigns ticket type, priority, and 6-prefix label taxonomy based on Step 2 discovery data.

Why: Correct classification drives SLA, routing, and CAB review requirements. Misclassification propagates through downstream stages.

What-if skip: Tickets misrouted, SLAs misapplied, CAB reviews missed for Changes that require them.

Decision Tree

Ticket received
├── Unplanned disruption to service → Incident
│ └── Underlying cause unknown → Problem (linked)
├── Planned modification to production → Change
│ ├── Standard (pre-approved template) → No CAB required
│ ├── Normal (requires CAB review) → Stage C + D
│ └── Emergency (P0/P1 impact) → Expedited CAB
└── User-requested provisioning → Service Request
└── Fulfillment via service catalog

6-Prefix Label Taxonomy

PrefixValuesPurpose
blast:single-service, single-account, multi-account, org-wideBlast radius for impact scoring
rca:code-defect, config-drift, infra-failure, human-error, 3rd-partyRoot cause category
ke:known-error-{id}Known error database reference
service:cloudops, finops, network, identity, computeAffected service domain
env:prod, staging, dev, drEnvironment scoping
tier:critical, important, best-effortBusiness impact tier (not tier-1/2/3)

Priority Matrix

Low UrgencyHigh Urgency
High ImpactP2 (4h SLA)P0 (15min SLA)
Low ImpactP4 (5d SLA)P3 (8h SLA)

P1 = High Impact, High Urgency, not platform-wide (P0 threshold).

Quality Gate

  • Ticket type assigned (Incident / Problem / Change / Service Request)
  • All 6 label prefixes populated
  • Priority P0-P4 set based on Impact × Urgency matrix
  • HITL approves label set before jira_update_issue call

Step 4: Cross-Validation (/itsm:cross-validate)

Who: sre-engineer gathers evidence. qa-engineer validates accuracy.

What: 4 independent sources corroborate the incident report before any change record is created.

Why: 99.5% accuracy gate. SELF_COMPARISON_VALIDATION prevented — sources must be independent, not same-process exports from the same API call.

What-if skip: Change record based on incorrect scope, wrong systems in blast radius, PIR missing affected components.

4 Independent Sources

SourceToolWhat It Provides
JIRA OPSatlassian-tools MCPTicket description, comments, affected systems
runbooks CLI/inventory:discoverLive resource inventory, cross-account
AWS APIREADONLY profiles ($AWS_OPERATIONS_PROFILE)CloudWatch alarms, Config events, flow logs
VisualCloudWatch Console / dashboardsTimeline reconstruction, anomaly visualization

5W1H Evidence Template

**What**: [Specific failure observed — service, endpoint, error message]
**Who**: [Affected users/systems — count, persona, account]
**When**: [Start time, detection time, escalation time — UTC ISO-8601]
**Where**: [Region, VPC, account ID reference, service domain]
**Why**: [Root cause hypothesis — rca: label maps here]
**How**: [Reproduce steps for engineering team]

This block is appended to the JIRA ticket description via ADF panel node (not blockquote — MCP ADF fidelity preserved via REST API v3 for structured nodes).

Quality Gate

  • All 4 sources queried (no CROSS_ACCOUNT_SILENT_ZERO — 0 results verified against account scope)
  • 5W1H evidence appended to JIRA ticket
  • Reproduce prompt attached for engineering team
  • Cross-validation delta ≤0.5% across independent sources

Step 5: Decide

Who: sre-engineer evaluates, 3-Party Golden Guideline gate applied.

What: Assess change scheduling eligibility, maintenance window recommendation, and blast radius impact.

Why: Formal go/no-go gate before implementation. Business, Security, and Engineering must align before resources are modified.

What-if skip: Changes proceed without approval; maintenance windows clash; blast radius underestimated; regulatory compliance unchecked.

3-Party Golden Guideline Gate

Change proceeds ONLY when ALL three parties approve:

PartyDecisionEvidence
Business (CFO)Cost/benefit acceptableCost impact analysis from Step 8
Security (CSO)Compliance trail completeAudit evidence, encryption status
Engineering (CTO)Rollback plan testedPre-staging validation completed

CxO Perspective

  • CFO: "What's the cost of this change?" — ongoing monitoring/alarms cost vs. maintenance cost
  • CTO: "What's the rollback plan?" — confidence in revert timings, downtime risk
  • CSO: "Does this pass security review?" — encryption, access control, audit logging

Quality Gate

  • 3-Party gate PASS (all three approvers sign off)
  • Maintenance window confirmed with HITL
  • Blast radius quantified (instance count, service impact, user count)
  • Rollback plan documented and tested

Step 6A: Implement — Change Management (/itsm:create-change)

Who: sre-engineer drafts. HITL approves before creation.

What: 7-section change record created as JIRA sub-task or linked issue under the incident.

Why: MSP cannot modify production environments without a formal change order. Audit trail required for APRA CPS 234.

What-if skip: Production changes executed without approval, audit gaps, regulatory non-compliance.

Conditional Stage

Stage C executes only for Change-type tickets (from Stage A classification). Incident and Service Request tickets skip to PIR.

7-Section Change Template

SectionContentRequired
1. SummaryOne-line description of the changeYes
2. Impact AssessmentSystems affected, blast radius, user countYes
3. Risk RatingLow/Medium/High with justificationYes
4. Rollback PlanStep-by-step revert procedure with time estimateYes
5. Test PlanPre/post validation commands with expected outputsYes
6. Implementation StepsNumbered runbook with commandsYes
7. ApprovalCAB reviewer list, approval date/timeYes (CAB changes)

HITL Gate

sre-engineer → drafts all 7 sections
→ attaches evidence from Stage B
→ presents draft to HITL
HITL → reviews rollback plan
→ approves or requests changes
→ executes: jira_create_issue (NOT agent-initiated)

Quality Gate

  • All 7 sections populated (no placeholders)
  • Rollback plan tested in non-prod before change window
  • Implementation window confirmed with CAB (for Normal changes)
  • Change ticket linked to originating Incident in JIRA

Step 6B: Implement — Change Request (/itsm:create-cr)

Who: sre-engineer creates CAB subtask. HITL schedules review.

What: CAB review subtask with implementation window, rollback plan, and test plan.

Why: Normal changes require explicit CAB approval with documented review. Emergency changes require expedited CAB with post-implementation review.

What-if skip: Changes executed without CAB oversight, audit trail incomplete, regulatory breach.

Conditional Stage

Stage D executes only when Stage A assigned label cr:normal or cr:emergency. Standard pre-approved changes skip Stage D.

CAB Subtask Fields

FieldNormal ChangeEmergency Change
Implementation WindowScheduled (≥48h notice)Expedited (HITL-approved window)
Rollback PlanFull 7-section planAbbreviated (minimum 3 steps)
Test PlanPre+post validationPost-implementation only
CAB Reviewers2 approvers minimum1 approver + post-review
Evidence RequiredStage B + Stage CStage B minimum

Quality Gate

  • CAB subtask linked to Change ticket (not standalone)
  • Implementation window confirmed in JIRA
  • At least 1 CAB approver assigned before window opens
  • Emergency changes trigger post-implementation review within 24h

Step 7: Operate

Who: sre-engineer configures. HITL owns alarm thresholds.

What: Define CloudWatch alarms, exception escalation path, rollback triggers, and 30/60/90 day review schedule.

Why: Post-implementation monitoring catches regression early. Absence of alarms means failures go undetected.

What-if skip: Changes deployed without monitoring; production issues undetected for hours/days; MTTR inflates.

Monitoring Checklist

  • CloudWatch alarms defined (threshold, SNS topic)
  • Exception escalation path documented (PagerDuty routing, on-call owner)
  • Rollback triggers identified (error rate > X%, latency > Y ms)
  • Review schedule set (Day 1, Week 1, Month 1)

CxO Perspective

  • CFO: "What's the monitoring cost?" — CloudWatch metrics, log ingestion, alarm evaluation
  • CTO: "How will we detect regression?" — alarm coverage across all modified resources
  • CSO: "Are audit logs enabled?" — CloudTrail for all Implement changes, Config rules for compliance

Quality Gate

  • Alarms configured in CloudWatch (not placeholders)
  • Escalation path integrated with PagerDuty/Slack
  • Rollback trigger thresholds set based on SLA
  • First review date added to JIRA ticket

Step 8: Govern

Who: sre-engineer drafts, HITL reviews, enterprise archives.

What: Cost impact summary (before/after), security posture delta, compliance evidence trail.

Why: Audit readiness and continuous improvement data. Without governance records, compliance audits fail.

What-if skip: No evidence trail, compliance gaps, inability to measure change impact, regulatory breach risk.

Governance Records

Record TypeSourceRequirement
Cost ImpactCost Explorer before/after$ actual delta vs. estimated
Security PostureConfig rules + IAM changesPolicy changes, encryption status
Compliance TrailCloudTrail, Config ConformanceEvidence of audit logging
Performance DeltaCloudWatch metricsLatency, error rate change

CxO Perspective

  • CFO: "What was the actual cost impact?" — monthly spend comparison, ROI validation
  • CTO: "Did we hit SLA?" — MTTR improvement, error rate reduction
  • CSO: "Is the compliance trail complete?" — Config Conformance pack status, evidence linkage

Quality Gate

  • Cost delta documented (actual vs. estimated)
  • Security policy changes summarized
  • Compliance evidence linked to JIRA ticket (CloudTrail, Config)
  • HITL approves governance record before archival

Post-Resolution: PIR (/itsm:create-pir)

Who: sre-engineer drafts. HITL reviews and publishes to Confluence OPS.

What: Post-Incident Review with 5-Why RCA, timeline, and action items.

Why: Without PIR, incidents recur. Action items with owners and due dates prevent recurrence.

What-if skip: Incidents repeat, team learns nothing, MTTR stagnates, pattern bridge never fires.

PIR Structure

## Post-Incident Review: {Ticket ID}

**Incident**: {title}
**Duration**: {start} → {resolved} ({total minutes})
**Severity**: P{n} | Impact: {blast: label}

## Timeline
| Time (UTC) | Event | Source |
|------------|-------|--------|
| HH:MM | Detection | CloudWatch alarm |
| HH:MM | Escalation | PagerDuty |
| HH:MM | Mitigation applied | Change record |
| HH:MM | Service restored | Monitoring |

## 5-Why Root Cause Analysis
1. Why did the service fail? → {observation}
2. Why did {observation} occur? → {deeper cause}
3. Why did {deeper cause} exist? → {system cause}
4. Why wasn't {system cause} detected? → {monitoring gap}
5. Why wasn't the monitoring gap closed? → {process gap}

**Root Cause**: {concise statement}
**rca: label**: {rca:code-defect | config-drift | infra-failure | human-error | 3rd-party}

## Action Items
| Action | Owner | Due Date | JIRA Ticket |
|--------|-------|---------|-------------|
| Fix monitoring gap | CloudOps Engineer | {date} | OPS-NNN |
| Update runbook | SRE | {date} | OPS-NNN |

Quality Gate

  • 5-Why analysis reaches process/system root cause (not stopping at symptom)
  • All action items have owners and due dates
  • PIR published to Confluence OPS space (not just JIRA)
  • JIRA ticket transitioned to Resolved only after PIR approved by HITL

Pattern Bridge: OPS to SPM

When 3 or more incidents share the same service: and rca: labels within a 90-day window, the pattern bridge triggers an auto-draft product story in JIRA SPM.

Why: Recurring incidents that exceed the MTTR budget have a root cause that belongs in the product backlog, not the operations queue.

Bridge Conditions

ConditionThresholdAction
Same service: + rca:3+ incidents in 90 daysAuto-draft SPM story
Same service: + P01 P0 incidentImmediate SPM story draft
MTTR > SLA × 3Any severityDraft SRE capacity story

PII Scrubbing

OPS tickets contain customer names, account IDs, and configuration data. Before creating an SPM story, replace:

  • Account IDs with {account_type} placeholders
  • Customer names with {customer_type} or persona labels
  • IP/CIDR with network tier labels (prod-vpc, dr-vpc)

Key: The bridge is intentionally one-directional. Operational incidents drive product improvements. Planned product changes follow the SDLC lifecycle via /sync:jira-push.


Regulatory Compliance

ITSM lifecycle steps meet ANZ FSI/Energy/Telecom regulatory requirements:

StepCompliance Area
IntakeIncident logging and timeliness
DiscoverAsset inventory and cross-account scope
ClassifyAsset identification and criticality assessment
Cross-ValidateControl testing and evidence gathering (99.5% accuracy gate)
DecideMaterial change approval (3-Party gate)
ImplementChange management proportional to risk + CAB review
OperateMonitoring controls and alert responsiveness
GovernAudit trail completeness and evidence preservation
PIRSystematic review and continuous improvement
Pattern BridgeRecurring issue escalation to product backlog

Detailed mapping (APRA CPS 234, CPS 230, NIST CSF 2.0) is documented in ADLC Governance Rules.


Component Map

ComponentTypePurpose
/itsm:lifecycleCommandFull 8-step pipeline orchestrator
/itsm:classifyCommandSteps 1–3: intake, discover, classify
/itsm:cross-validateCommandStep 4: 4-source evidence collection
/itsm:create-changeCommandStep 6A: change record creation
/itsm:create-crCommandStep 6B: CAB change request
/itsm:create-pirCommandPost-resolution: post-incident review
sre-engineerAgentExecutes all ITSM commands (haiku tier)
itsm-lifecycle-stepsSkill8-step definitions: intake, discover, decide, operate, govern
itsm-ticket-classificationSkillStep 3: classification decision tree + 6-prefix taxonomy
itsm-change-managementSkillSteps 5–6: change workflow and risk assessment
3-party-golden-guidelineSkillStep 5: Business/Security/Engineering approval gate
itsm-ticket-templateSkillStandardized OPS + SPM ticket templates
jira-jsm-service-deskSkillJSM lifecycle and SLA management
atlassian-toolsMCP72 tools for JIRA + Confluence

Common Mistakes (What NOT to Do)

MistakeWhy It FailsFix
TECHNICAL_WITHOUT_PROCESSChanges without change order and approverAll infrastructure actions must include approval process
AUTONOMOUS_SPRINT_ASSIGNMENTAgent moves OPS tickets to sprint without HITLSprint assignment is HITL-exclusive action
INCOMPLETE_EVIDENCEChange records built on incomplete factsAll 4 sources verified before creating changes

Worked Example: Instance Scheduler Expansion (OPS-180)

A customer emailed asking to extend automated overnight EC2 scheduling across more accounts. The ops team processed this through all 8 steps in a single sprint cycle.

In three sentences: The scheduler already runs in 8 accounts, stopping instances at 8pm and restarting at 8am Monday–Friday. Discovery revealed 117 instances across 17 accounts; the team validated scope against four independent sources before moving to CAB approval. Phase 1 will pilot one additional account after passing the 3-Party Golden Guideline gate.

How Each Step Played Out

Step 1 — Intake: Customer email arrived; a JIRA OPS ticket was created as type Change with minimal labels. Service type (compute) and environment (prod) extracted automatically.

Step 2 — Discover: Config Aggregator inventory query returned 117 instances across 17 accounts. Excel spreadsheet (customer's prior version) showed 110 instances across 16 accounts — discrepancy flagged for Step 4.

Step 3 — Classify: The ticket was classified as Change (not Task) because scheduler modification requires a change order. Labels added: service:ec2, tier:important, itsm-type:change, blast:multi-service, cr:normal.

Step 4 — Cross-Validate: Four independent queries were executed before any change planning began:

Data SourceResult
Customer spreadsheet (Excel)110 instances across 16 accounts
AWS org-wide resource search117 instances across 17 accounts
AWS Config (env-tagged instances only)94 instances across 13 accounts
Billing data (EC2 Compute share)~1.88% of total org spend

The gap between 110 and 117 reflects instances launched after the spreadsheet was last updated. The lower count from the env-tag query flags a tagging gap — two account environments use non-standard tag values and are invisible to the current scheduler rules.

Step 5 — Decide: The 3-Party Golden Guideline gate assessed:

  • Business (CFO): Estimated savings $2,400/month (16 instances × $150/month = $2,400 avoided spend during off-hours). Cost of monitoring alarms: $180/month. ROI positive — approve.
  • Security (CSO): No encryption changes, no policy modifications. Existing CloudTrail logging applies. Approve.
  • Engineering (CTO): Scheduler is managed code; rollback is disable-cron + force-start. Tested in sandbox. Approve.

All three parties approved the change. Maintenance window: Tuesday 22:00–00:00 NZT (2-hour window, 50% of active instances in batch).

Step 6A — Create Change: The change record documented the AS-IS state (8 confirmed accounts, dev/preprod/sit tags, 8pm–8am NZT window) and three expansion candidates: sandbox, shared-services, datalake-dev. The tag gap on two non-standard environments was flagged as a prerequisite — those accounts cannot be added until tags are normalised.

Step 6B — CAB Review: The change was routed as a Normal change (cr:normal) with a 48-hour advance notice requirement. CAB conditions: the customer must submit an always-on exception list (batch jobs and compliance processes that must not stop) before the pilot window opens. Phase 1 covers one account only; Phase 2 requires a separate CAB update.

Step 7 — Operate (during pilot window): CloudWatch alarms configured for scheduler errors. Exception escalation routed to on-call SRE. Rollback trigger: >5 failed start operations in 1 hour. Review schedule: Day 1 (spot-check), Week 1 (full reconciliation), Month 1 (savings report).

Step 8 — Govern (after pilot week): Cost Explorer showed $287 actual savings (vs. $2,400 estimated for full month; pilot was 1 account). Discrepancy investigated: only 14 of 16 instances stopped successfully (2 had user connections that forced stop failure). Security audit: CloudTrail logged all scheduler actions. Compliance trail linked to JIRA ticket.

Customer Decision Gate

Before Phase 2 proceeds, the customer must answer:

  1. Which account should be the Phase 1 pilot?
  2. What workloads must never be stopped (always-on exception list)?
  3. Should the tag gap in the two non-standard environments be fixed, or should those accounts be excluded?

The CloudOps Manager owns the AWSO backlog ticket (due before the Phase 2 CAB update). No expansion beyond the pilot account occurs without customer confirmation at this gate.


Agent Team

AgentRole in This PathPhase/StageTalent Bench
sre-engineer8-step ITSM execution + runbook orchestrationAll phases (Intake→Govern)Profile
security-compliance-engineerSecurity posture assessment + compliance validationClassify/Cross-validateProfile
cloud-architectArchitecture review + risk assessment for mutationsDecide/ImplementProfile
product-ownerTicket prioritization + business impact scoringIntake/TargetProfile
qa-engineer4-source cross-validation scoring + evidence validationCross-validateProfile

7 Skills Coverage

SkillCoverage in This PathImplementation
S1 System Design8-step pipeline architecture (Intake→Discover→Classify→Cross-validate→Decide→Implement→Operate→Govern)ITSM lifecycle model, decision gates, escalation protocols
S2 Tool DesignJIRA MCP atlassian-tools + runbooks CLI strict schemas + AWS API contractsTool integration, schema validation, error message specificity
S3 Retrieval4-source cross-validation: Excel inventory, AWS API, runbooks CLI, JIRA OPSIndependent data sources, query diversity, accuracy targets ≥99.5%
S4 ReliabilityAPI retry logic (3 attempts, exponential backoff) + fallback paths for circuit breaker scenariosPagination, timeout enforcement, error recovery per operational-efficiency.md Rule 6
S5 SecurityHITL gates before FIX/Implement mutations, READONLY profile restriction, change management process validationAccess control, approval workflow, audit trail (CloudTrail)
S6 Evaluation4-way cross-validation (Excel vs AWS vs CLI vs JIRA) + CxO-ready risk scoringEvidence aggregation, persona-based filtering (CFO/CTO/CloudOps), accuracy measurement
S7 Product ThinkingPersona-specific reports (CFO: cost/risk, CTO: architecture/dependencies, CloudOps: runbook/automation), email templates for stakeholder updatesReport generation, audience segmentation, business impact framing

Last Updated: April 2026 (8-step model) | Status: Active | Maintenance: sre-engineer