> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensor9.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Observability

Observability in Tensor9 enables you to monitor all your customer appliances from a single observability platform. Logs, metrics, and traces flow from each customer's infrastructure to your control plane, which then routes each telemetry stream to the observability sinks you've configured, giving you unified visibility across all deployments regardless of where they run.

## How observability works

When you deploy applications through Tensor9, each customer appliance runs in isolated infrastructure (their AWS account, Google Cloud project, or private environment). Without observability, you would have no visibility into how these appliances are performing, whether deployments succeeded, or how customers are using your application.

Tensor9's observability system solves this by collecting telemetry from each appliance and forwarding it to your centralized observability platform:

<Steps>
  <Step title="Telemetry generation">
    Resources in the customer appliance (containers, Lambda functions, databases, load balancers) generate logs, metrics, and traces during normal operation.
  </Step>

  <Step title="Collection">
    Tensor9 uses [steady-state permissions](/fundamentals/permissions-model#steady-state-permissions) to collect telemetry from appliance resources. Collection runs inside the appliance, tapping the telemetry your resources already emit and forwarding it to your control plane (see [How telemetry flows](#how-telemetry-flows) for the per-runtime mechanism).
  </Step>

  <Step title="Forwarding">
    Collected telemetry is forwarded from the customer appliance to your control plane over secure connections. Your control plane then routes each telemetry stream to the observability sink (or sinks) you've configured. You decide which sources feed which sinks, per signal (see [Telemetry routing](#telemetry-routing)).
  </Step>

  <Step title="Analysis">
    Your team monitors all customer appliances from your observability platform. You can track deployment health, investigate incidents, analyze usage patterns, and troubleshoot issues across all customers from one place.
  </Step>
</Steps>

<Note>
  Observability collection uses steady-state permissions, which are always active and read-only. Customers do not need to approve observability access; it runs continuously to ensure you maintain visibility into appliance health.
</Note>

## What you can observe

Tensor9 collects comprehensive telemetry from customer appliances:

### Application logs

Logs from your application components running in customer appliances:

* **Container logs** from your workloads in EKS, GKE, AKS, or private Kubernetes, captured through your existing log agent (Datadog, OpenTelemetry, or Loki)
* **Function logs**: Execution logs from Lambda, Cloud Functions, or Azure Functions
* **Application logs**: Custom application logs written to CloudWatch, Cloud Logging, or other logging services

Logs include the `t9_appliance_id` and `t9_customer_name` (plus `t9_service_name` for CloudWatch-sourced logs), allowing you to filter and correlate logs across customers.

### Infrastructure metrics

Performance and health metrics from infrastructure resources:

* **Compute metrics**: CPU, memory, network for containers, VMs, or functions
* **Database metrics**: Connections, queries per second, replication lag, storage utilization
* **Storage metrics**: Object count, storage used, request rates for S3, GCS, or Azure Blob Storage
* **Load balancer metrics**: Request count, latency, error rates, healthy/unhealthy targets

### Custom metrics

Application-level metrics you instrument in your code:

* **Business metrics**: User signups, API calls, feature usage
* **Performance metrics**: Request duration, queue depth, cache hit rates
* **Error tracking**: Exception rates, failed operations, validation errors

### Distributed traces

Request traces across your application components:

* **Cross-service traces**: Track requests across microservices, databases, and external APIs
* **Performance analysis**: Identify slow operations and bottlenecks
* **Dependency mapping**: Visualize how services communicate within an appliance

## Observability sinks

An observability sink is a destination your telemetry is forwarded to. You can configure **multiple sinks** and route different telemetry to each. For example, application logs can go to Datadog while high-volume infrastructure logs go to CloudWatch. Tensor9 supports these sink types natively:

| Sink                                     | Logs | Metrics | Traces | Configuration                             |
| ---------------------------------------- | :--: | :-----: | :----: | ----------------------------------------- |
| **Datadog**                              |   ✓  |    ✓    |    ✓   | API key and site                          |
| **CloudWatch**                           |   ✓  |    ✓    |        | Default credentials or cross-account role |
| **OpenTelemetry (OTLP)** *(coming soon)* |   ✓  |    ✓    |        | OTLP endpoint and optional authentication |
| **Loki**                                 |   ✓  |         |        | Endpoint and credentials                  |
| **Elasticsearch** *(coming soon)*        |   ✓  |         |        | Cluster endpoint(s) and credentials       |
| **Prometheus Remote Write**              |      |    ✓    |        | Endpoint and credentials                  |

Any backend that speaks **OTLP** (New Relic, Sumo Logic, Honeycomb, Grafana Cloud, and most modern observability platforms) can be used through the **OpenTelemetry** sink. Grafana stacks are reached directly through the **Loki** (logs) and **Prometheus Remote Write** (metrics) sinks.

You configure sinks in your control plane, and Tensor9 applies the configuration to all of that app's appliances automatically.

## Telemetry routing

By default, each sink receives only the telemetry from its **matching source**: a Datadog sink receives Datadog telemetry, a CloudWatch sink receives CloudWatch logs, a Loki sink receives Loki logs. This keeps high-volume infrastructure logs (such as Kubernetes or CloudWatch control-plane logs) out of your SaaS sinks, where they would inflate cost, unless you deliberately send them there.

When you need a different topology, you control exactly which sources feed which sinks, independently for logs, metrics, and traces. You edit routing visually in the portal's Routing view; see [Route your telemetry](/fundamentals/configuring-observability#route-your-telemetry).

### Telemetry sources

Tensor9 recognizes telemetry from these sources in your appliances:

| Source                      | Logs | Metrics | Traces |
| --------------------------- | :--: | :-----: | :----: |
| **Datadog** (Datadog Agent) |   ✓  |    ✓    |    ✓   |
| **OpenTelemetry** (OTLP)    |   ✓  |    ✓    |    ✓   |
| **Loki**                    |   ✓  |         |        |
| **Prometheus**              |      |    ✓    |        |
| **CloudWatch**              |   ✓  |         |        |

### Per-signal routing

Routing is **per signal**. You can send a source's logs to one sink and its metrics to another, or fan a single source out to several sinks. For each sink, logs, metrics, and traces are routed independently:

* **Default** (no routes set): the sink receives only its matching source for each signal. OpenTelemetry and Elasticsearch sinks have no matching source (OTLP is vendor-neutral, so it is routed explicitly rather than matched by default), so they receive nothing until you route something to them. Both sink types are coming soon.
* **Routed**: the sink receives exactly the sources you connect, for that signal.
* **Disabled**: remove every route for a signal and that signal is no longer delivered to that sink.

<Tip>
  You can configure and manage observability sink settings from the **Vendor Portal** under **Observability**, or via the CLI as shown below.
</Tip>

## Configuring observability

Observability is **off by default**; you turn it on per appliance, and optionally per resource, in the vendor portal.

<Card title="Configuring Observability" icon="sliders" href="/fundamentals/configuring-observability">
  Add sinks, wire up routing, control which appliances and resources are observed, and instrument telemetry in your origin stack.
</Card>

## How the pipeline works

Under the hood, telemetry moves through a fixed pipeline:

<Frame>
  <img src="https://mintcdn.com/tensor9/xJGBjg2i2Jzf4UCk/images/diagrams/observability-pipeline.svg?fit=max&auto=format&n=xJGBjg2i2Jzf4UCk&q=85&s=6ff4cb89dd4e5ab3af0a3baaea37c9c8" alt="Observability pipeline: native cloud logging, to a forwarder, to a vendor-owned stream, to a router, to your sinks" width="1110" height="372" data-path="images/diagrams/observability-pipeline.svg" />
</Frame>

1. **Buffered in native cloud logging.** Collected logs, metrics, and traces are written to the customer's native logging service (CloudWatch Logs on AWS, Cloud Logging on Google Cloud, Azure Monitor Logs on Azure), which buffers them and doubles as the [customer audit trail](#customer-audit-trail).
2. **Forwarded to your control plane.** A forwarder tags each record at the edge with its [appliance and customer identity](#appliance-identification), then sends it on to a stream that you (the vendor) own.
3. **Pushed to your sinks.** A router reads the stream, applies your [routing](#telemetry-routing), and pushes each telemetry stream to its configured sink. Sink credentials are applied in your control plane and never leave it; they are never deployed to a customer appliance.

The observability pipeline scales automatically with the volume of telemetry, so it absorbs traffic spikes without any tuning on your part.

## Customer audit trail

Before any telemetry leaves the customer's environment, it is recorded in their own native log service: CloudWatch Logs on AWS, Cloud Logging on Google Cloud, and Azure Monitor Logs on Azure. The collected telemetry passes through this local record on its way to your control plane, so the customer keeps a complete, independent copy of exactly what was captured and forwarded out of their account. They can audit everything that crosses their boundary. In the future, customers will also be able to redact and filter this telemetry before it is forwarded, giving them direct control over what leaves their environment.

## Observability across form factors

Observability collection adapts to each appliance's [form factor](/fundamentals/key-concepts#form-factor):

| Form Factor            | Log Collection              | Metrics Collection                               | Trace Collection                                    |
| ---------------------- | --------------------------- | ------------------------------------------------ | --------------------------------------------------- |
| **AWS**                | CloudWatch Logs             | CloudWatch Metrics, resource-specific metrics    | X-Ray or application instrumentation                |
| **Google Cloud**       | Cloud Logging               | Cloud Monitoring, resource-specific metrics      | Cloud Trace or application instrumentation          |
| **Azure**              | Azure Monitor Logs          | Azure Monitor Metrics, resource-specific metrics | Application Insights or application instrumentation |
| **DigitalOcean**       | Logs via Fluent Bit/Fluentd | Prometheus metrics                               | OpenTelemetry Collector                             |
| **Private Kubernetes** | Logs via Fluent Bit/Fluentd | Prometheus metrics                               | OpenTelemetry Collector                             |
| **On-prem**            | Logs via Fluent Bit/Fluentd | Prometheus metrics                               | OpenTelemetry Collector                             |

Tensor9 provisions the appropriate collection infrastructure for each environment during compilation.

## How telemetry flows

Tensor9 captures the telemetry your application already emits and carries it to your sinks, without new agents or application-code changes. How it taps in depends on the runtime:

### Kubernetes

Tensor9 deploys a lightweight **collection DaemonSet** to the cluster and redirects your existing telemetry agents to it. Whatever your workloads already run (the **Datadog Agent**, an **OpenTelemetry Collector**, **Prometheus**, or **Loki**) keeps running unchanged but sends through the Tensor9 collector, which tags its logs, metrics, and traces with the appliance and customer metadata and forwards them to your control plane. No changes to your workloads.

### AWS Lambda

Tensor9 injects a **Lambda extension** into your functions during compilation. The extension intercepts the function's telemetry (Datadog, OpenTelemetry, and so on), tags it, and forwards it, again without changing your function code.

In both cases the capture happens inside the customer's environment; only the resulting telemetry leaves it.

### Native cloud logs

Both paths above work by emitting your application's logs, metrics, and traces into the **native cloud log service** (CloudWatch Logs on AWS, and the equivalent on other clouds). Tensor9 collects by **mirroring that service**, so anything written to it flows to your sinks.

A useful consequence: the cloud's own logs are forwarded too, with no extra setup. Control-plane logs (for example, EKS control-plane logs) and VPC flow logs reach your sinks through the same path.

## Appliance identification

Every forwarded telemetry record is stamped with two Tensor9 identity tags so you can attribute it to an appliance and customer:

* **t9\_appliance\_id**: Tensor9's unique identifier for the appliance
* **t9\_customer\_name**: Customer that owns the appliance

The emitting **service** is identified by the source's own convention: CloudWatch-sourced telemetry adds a **`t9_service_name`** tag (derived from the log group), while Datadog, Loki, and Prometheus telemetry carry the service in their native tag/label (Datadog `service`, Kubernetes `app`/`container`). Tensor9 doesn't override those.

These tags let you filter, group, and correlate telemetry across customers and services. They're distinct from `instance_id`, the origin-stack variable you tag resources with (see [Configure telemetry in your origin stack](/fundamentals/configuring-observability#configure-telemetry-in-your-origin-stack)).

### Example: Filtering logs by customer

In Datadog:

```
service:myapp-api t9_customer_name:acme-corp
```

In Grafana Loki:

```
{t9_customer_name="acme-corp", app="myapp-api"}
```

## Unified dashboards

With telemetry from all appliances flowing to your observability sink, you can create unified dashboards that aggregate metrics across customers:

* **Deployment health**: Track successful vs. failed deployments across all appliances
* **Performance trends**: Compare response times and error rates across customers
* **Resource utilization**: Monitor database CPU, storage usage, function execution counts
* **Version adoption**: See which customers are running which versions

You can also create customer-specific dashboards filtered to a single `t9_appliance_id` or `t9_customer_name` for troubleshooting individual appliances.

## Telemetry and customer data

<Note>
  **Your responsibility**: Tensor9 does not guarantee that your logs do not contain customer data. It is your responsibility as the vendor to ensure that your application does not log sensitive customer data (PII, financial information, proprietary content, or customer business data) that will be forwarded to your observability sink.
</Note>

Observability telemetry should contain application logs and infrastructure metrics, not customer business data. While logs may include operational metadata (timestamps, user IDs, API endpoints, error codes), they should never include sensitive customer information.

**You must take precautions to prevent customer data from appearing in logs:**

* **Sanitize logs**: Remove or redact sensitive information before logging. Never log request/response payloads containing customer data.
* **Use structured logging**: Log metadata and identifiers, not full payloads. Log `user_id: 12345` instead of the entire user object.
* **Configure log levels**: Use DEBUG/INFO for development, WARN/ERROR for production. Avoid verbose logging that may capture customer data.
* **Review what you collect**: Audit what data flows to your observability sink. Test your logging to ensure no customer data leaks through.
* **Filter at the source**: Configure log filters to exclude patterns that may contain sensitive data (credit card numbers, SSNs, API keys).

Tensor9 forwards whatever telemetry your application emits; it is your responsibility to ensure that telemetry does not contain customer data.

## Observability permissions

Telemetry collection requires [steady-state permissions](/fundamentals/permissions-model#steady-state-permissions) in customer appliances. These permissions are:

* **Read-only**: Cannot modify infrastructure or customer data
* **Always active**: Observability runs continuously without customer approval
* **Scoped to vendor resources**: Can only access resources deployed by your application

Example steady-state role for observability in AWS:

```json theme={null}
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricData",
        "cloudwatch:ListMetrics"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/tensor9:instance": "${var.instance_id}"
        }
      }
    }
  ]
}
```

This role allows reading metrics only from resources tagged with the appliance's `instance_id`.

## Alerting and incident response

Once telemetry flows to your observability platform, you can configure alerts that notify your team when issues occur across customer appliances:

* **Deployment failures**: Alert when a deployment to any appliance fails
* **High error rates**: Notify when error rates exceed thresholds
* **Performance degradation**: Alert on slow response times or database latency
* **Resource exhaustion**: Warn when databases approach storage limits

Alerts can include the `t9_appliance_id` and `t9_customer_name`, allowing you to quickly identify which customer is affected and route incidents to the right team.

## Best practices

<AccordionGroup>
  <Accordion title="Never log customer data">
    It is your responsibility to ensure your application does not log sensitive customer data (PII, financial information, customer business data). Tensor9 forwards whatever telemetry your application emits; it does not filter or sanitize logs for customer data. Implement log sanitization in your application code, avoid logging request/response payloads, and regularly audit what data flows to your observability sink.
  </Accordion>

  <Accordion title="Tag all resources with instance_id">
    Ensure every resource in your origin stack is tagged with the `instance_id` variable. This enables filtering telemetry by appliance and ensures observability permissions are correctly scoped.
  </Accordion>
</AccordionGroup>

## Related topics

* [**Permissions Model**](/fundamentals/permissions-model): Understanding steady-state permissions for observability
* [**Appliances**](/fundamentals/appliances): Customer environments where telemetry is collected
* [**Deployments**](/fundamentals/deployments): Tracking deployment success through observability
* [**Operations**](/fundamentals/operations): Using observability to inform remote operations