Build trusted agents that production can rely on

AURA is an open-source harness for production-ready agents that run operations work alongside your engineering teams, taking on routine work so they can focus on decisions that matter, while keeping control and data in their hands.

Get started on GitHub Join us on Slack

# Orchestrator routes to specialist agents
[llm]
provider = "openai"
api_key = "{{ env.OPENAI_API_KEY }}"
model = "gpt-5.2"


[[vector_stores]]
name = "runbooks"
type = "qdrant"
url = "http://{{ env.QDRANT_HOST | default: 'localhost' }}:6334"
collection_name = "sre_runbooks"
context_prefix = "Operational runbooks covering incident response procedures, known failure modes, and troubleshooting guides"
embedding_model = { provider = "openai", model = "text-embedding-3-small", api_key = "{{ env.OPENAI_API_KEY }}" }


[agent]
name = "SRE Orchestrator"
system_prompt = """
You are an SRE Orchestrator. Decompose incident response tasks and delegate:
- incident-responder: PagerDuty incident lookup, alert details, oncall schedules
- metrics-analyst: Prometheus queries to validate alerts and check trends
- log-analyst: Log search, error patterns, timeline correlation


Maximize parallel execution when tasks have no data dependency.
"""
turn_depth = 15
temperature = 0.3


[mcp]
sanitize_schemas = true


[mcp.servers.pagerduty]
transport = "http_streamable"
url = "https://mcp.pagerduty.com/mcp"
headers = { Authorization = "Token token={{ env.PAGERDUTY_API_KEY }}" }
description = "PagerDuty MCP for incident details, oncall schedules, and alert status"


[mcp.servers.prometheus]
transport = "http_streamable"
url = "http://{{ env.PROMETHEUS_MCP_HOST | default: 'localhost' }}:8080/mcp"
description = "Prometheus MCP for querying system metrics"


[mcp.servers.log_analysis]
transport = "http_streamable"
url = "https://mcp.mezmo.com/mcp"
description = "Log analysis MCP for searching and correlating log events"


[orchestration]
enabled = true


[orchestration.worker.incident-responder]
description = "PagerDuty incident triage: fetch incident details, parse alerts, check oncall schedules"
turn_depth = 8
mcp_filter = [
  "list_incidents",
  "get_incident",
  "list_alerts_from_incident",
  "get_alert_from_incident",
  "list_services",
  "get_service",
  "get_current_time",
]
preamble = """
You are an Incident Responder. Use PagerDuty tools to fetch and parse incidents.
Extract: environment, alert category, severity, timestamp, metric value, RunBook URL, and triggering query.
Always use tools — do not fabricate incident data.
"""


[orchestration.worker.metrics-analyst]
description = "Prometheus metrics analysis: validate alerts, check trends, identify anomalies"
turn_depth = 20
mcp_filter = [
  "execute_query",
  "execute_range_query",
  "list_metrics",
  "get_current_time",
]
preamble = """
You are a Metrics Analyst. Query Prometheus to validate alerts, check trends, and identify anomalies.
Always get current time before range queries. Do not fabricate metric values.
Report query results clearly with metric names, labels, and values.
"""


[orchestration.worker.log-analyst]
description = "Log analysis: search logs, analyze error patterns, correlate events across time"
turn_depth = 20
vector_stores = ["runbooks"]
mcp_filter = [
  "analyze_logs_*",
  "deduplicate_logs_*",
  "get_correlated_timeline_*",
  "get_current_time",
  "get_log_histogram",
  "list_log_fields",
]
preamble = """
You are a Log Analyst. Search and analyze logs for operational investigations.
Search runbooks for known failure patterns when errors match documented scenarios.
Report findings with timestamps, error messages, and relevant context.
"""

The production runtime for AI workflows

AURA is purpose-built to define, run, serve, integrate, and govern production AI workflows at scale. It ships the 80% of runtime infrastructure teams otherwise rebuild from scratch for every agent, so your engineers focus on the workflows that matter to your business.

Bring your models. Connect your stack. Deploy on your own infrastructure.

Understand

Root cause analysis grounded in the systems AURA already operates. It shows its work, so SREs can trust the diagnosis, and shares findings across agents and tools through open standards like MCP.

Act

Remediation, change routing, security response, cost control. AURA decides the appropriate action and executes routine work only when a human signs off, escalating novel or risky changes for review with complete context.

Improve

AURA compounds. Each incident feeds prevention, hardening, and change validation into the next cycle, making the system quicker to diagnose, safer to change, and cheaper to run.

Create production-ready AI workflows

Most teams spend months rebuilding the same runtime before anything useful ships. AURA eliminates 80% of that boilerplate. It's the open-source agent harness built to define, run, and govern production AI workflows at scale.

Bring your models. Connect your stack. Deploy on your own infrastructure.
‍

Compounding

Every finding sharpens the next decision. Patterns learned in earlier work carry forward, making later decisions faster and more accurate. These are compounding agents.

Composable

Built to fit your stack. Your data and tools snap into your existing stack: MCP-native, framework-agnostic, exposed through familiar interfaces, with no lock-in.

Open

Production trust cannot depend on a black box. Read the code, run it on your infrastructure, inspect its reasoning and actions, and decide what it is allowed to do.

Progressive

At first the SRE approves AURA actions. Over time, AURA learns from routine work. Eventually AURA reviews on its own, pausing only for rare calls that need judgment. Each stage earns the next.

Why open source?

Production trust cannot depend on a black box.

‍

Inspectability and trust

If an agent acts in production, your team must see how it reasons. Open source is the only honest answer.

No lock-in on orchestration

TOML + MCP means swap LLMs, extend workflows, connect your own systems. The harness is yours.

A strong and visible foundation

Configs and workflow patterns can be shared, reviewed, and version-controlled across teams.

Extensible and transparent

Extensible via MCP

AURA connects to any MCP backend: Mezmo, Datadog, Grafana, Elastic, PagerDuty. Add tool interfaces without changing orchestration logic.

Custom agentic workflows

Declarative configuration. Swap LLM providers with one-line changes. Extend workflows without vendor permission.

Transparent by design

See workflow execution, inputs and outputs, and the tool calls and context behind every agent decision.

Human and system interfaces

OpenAI-compatible /v1/chat/completions endpoint. Drop AURA into any existing chat UI without adapter code.

Scale without burning out

Production is outgrowing what platform teams can sustain.
AURA gives them room to breathe.

Monitor, respond, remediate, validate, document: the loop that keeps a system healthy. One engineer can run it for one service. Across hundreds, continuously, with no one free to watch, it stops being something a team can hold. AURA runs that loop the way they would, checks and corrections included, so it keeps watch and the team gets room to breathe again.

Get started on GitHub Talk to an engineer