Build trusted agents that production can rely on
AURA is an open-source harness for production-ready agents that run operations work alongside your engineering teams, taking on routine work so they can focus on decisions that matter, while keeping control and data in their hands.
# Orchestrator routes to specialist agents
[llm]
provider = "openai"
api_key = "{{ env.OPENAI_API_KEY }}"
model = "gpt-5.2"
[[vector_stores]]
name = "runbooks"
type = "qdrant"
url = "http://{{ env.QDRANT_HOST | default: 'localhost' }}:6334"
collection_name = "sre_runbooks"
context_prefix = "Operational runbooks covering incident response procedures, known failure modes, and troubleshooting guides"
embedding_model = { provider = "openai", model = "text-embedding-3-small", api_key = "{{ env.OPENAI_API_KEY }}" }
[agent]
name = "SRE Orchestrator"
system_prompt = """
You are an SRE Orchestrator. Decompose incident response tasks and delegate:
- incident-responder: PagerDuty incident lookup, alert details, oncall schedules
- metrics-analyst: Prometheus queries to validate alerts and check trends
- log-analyst: Log search, error patterns, timeline correlation
Maximize parallel execution when tasks have no data dependency.
"""
turn_depth = 15
temperature = 0.3
[mcp]
sanitize_schemas = true
[mcp.servers.pagerduty]
transport = "http_streamable"
url = "https://mcp.pagerduty.com/mcp"
headers = { Authorization = "Token token={{ env.PAGERDUTY_API_KEY }}" }
description = "PagerDuty MCP for incident details, oncall schedules, and alert status"
[mcp.servers.prometheus]
transport = "http_streamable"
url = "http://{{ env.PROMETHEUS_MCP_HOST | default: 'localhost' }}:8080/mcp"
description = "Prometheus MCP for querying system metrics"
[mcp.servers.log_analysis]
transport = "http_streamable"
url = "https://mcp.mezmo.com/mcp"
description = "Log analysis MCP for searching and correlating log events"
[orchestration]
enabled = true
[orchestration.worker.incident-responder]
description = "PagerDuty incident triage: fetch incident details, parse alerts, check oncall schedules"
turn_depth = 8
mcp_filter = [
"list_incidents",
"get_incident",
"list_alerts_from_incident",
"get_alert_from_incident",
"list_services",
"get_service",
"get_current_time",
]
preamble = """
You are an Incident Responder. Use PagerDuty tools to fetch and parse incidents.
Extract: environment, alert category, severity, timestamp, metric value, RunBook URL, and triggering query.
Always use tools — do not fabricate incident data.
"""
[orchestration.worker.metrics-analyst]
description = "Prometheus metrics analysis: validate alerts, check trends, identify anomalies"
turn_depth = 20
mcp_filter = [
"execute_query",
"execute_range_query",
"list_metrics",
"get_current_time",
]
preamble = """
You are a Metrics Analyst. Query Prometheus to validate alerts, check trends, and identify anomalies.
Always get current time before range queries. Do not fabricate metric values.
Report query results clearly with metric names, labels, and values.
"""
[orchestration.worker.log-analyst]
description = "Log analysis: search logs, analyze error patterns, correlate events across time"
turn_depth = 20
vector_stores = ["runbooks"]
mcp_filter = [
"analyze_logs_*",
"deduplicate_logs_*",
"get_correlated_timeline_*",
"get_current_time",
"get_log_histogram",
"list_log_fields",
]
preamble = """
You are a Log Analyst. Search and analyze logs for operational investigations.
Search runbooks for known failure patterns when errors match documented scenarios.
Report findings with timestamps, error messages, and relevant context.
"""The production runtime for AI workflows
Bring your models. Connect your stack. Deploy on your own infrastructure.
Root cause analysis grounded in the systems AURA already operates. It shows its work, so SREs can trust the diagnosis, and shares findings across agents and tools through open standards like MCP.
Remediation, change routing, security response, cost control. AURA decides the appropriate action and executes routine work only when a human signs off, escalating novel or risky changes for review with complete context.
AURA compounds. Each incident feeds prevention, hardening, and change validation into the next cycle, making the system quicker to diagnose, safer to change, and cheaper to run.
Create production-ready AI workflows
Bring your models. Connect your stack. Deploy on your own infrastructure.
Why open source?
Extensible and transparent
Scale without burning out
AURA gives them room to breathe.
Monitor, respond, remediate, validate, document: the loop that keeps a system healthy. One engineer can run it for one service. Across hundreds, continuously, with no one free to watch, it stops being something a team can hold. AURA runs that loop the way they would, checks and corrections included, so it keeps watch and the team gets room to breathe again.
