ProsodyAI

Prosody intelligence
for voice pipelines.

An SSM-based model that runs parallel to your ASR, extracting prosodic features and streaming per-utterance classification with forward prediction. O(n) complexity. LoRA fine-tunable on your data.

Mamba SSM<200ms p9932-dim prosodic featuresForward prediction

View output format Read the paper

Output format

Per-utterance emotion classification with valence-arousal-dominance vectors, word-level alignment, and vertical-specific state mappings. Streamed via WebSocket or polled via REST.

wss://api.prosody.ai/v1/streamconnected

sess_8f2a1b · contact_center

Transcript

00:02

CUSTOMERfrustrated 0.84

Yeah, hi. This is the third time I'm calling about this billing issue.

Prosodic features

f0198 Hz

energy-9.8 dB

jitter1.5%

rate5.2/s

Response · utterance 1/8

{
  "emotion": "frustrated",
  "confidence": 0.84,
  "vad": [0.3, 0.65, 0.58],
  "vertical_state": "impatient",
  "escalation_risk": "medium",
  "prosody": { f0: 198, energy: -9.8, ...28 dims },
  "predictions": { ... }
}

Forward predictionsconf: 0.34

will_escalate38%

churn_risk22%

final_csat2.6/5.0

recommended_tone"empathetic"

0.30

0.65

0.58

Infrastructure

Designed for production voice pipelines. Streaming inference, deterministic outputs, horizontal scaling.

Sub-200ms p99

Streaming inference with per-utterance output. Warm start maintains state across the session.

32-dim feature extraction

F0, energy, jitter, shimmer, HNR, MFCCs, spectral centroid, speech rate, pause duration. Extracted per frame.

Multi-head output

Emotion softmax, VAD regression, and vertical-specific state mapping from a single forward pass.

Drop-in integration

REST and WebSocket APIs. Python and JavaScript SDKs. Runs parallel to any STT provider.

On-premise available

Deploy on your infrastructure. Audio never leaves your VPC. SOC 2 Type II compliant.

800+ QPS per node

O(n) inference complexity via SSM architecture. Horizontal scaling with stateless workers.

ConversationPredictor

Forward prediction

A causal GRU that consumes per-utterance ProsodySSM outputs and predicts session-level outcomes at every timestep. 8 predictive heads, O(1) incremental updates, confidence scaling with sequence length.

Confidence scales with sequence length

c = min(1.0, 0.3 + 0.7 · n/W). Predictions sharpen as the GRU accumulates utterance history.

After 3 utterances · Confidence: Low

Escalation Risk42%

Predicted CSAT3.8/5.0

Churn Probability18%

Tone Recommendation

Monitor — maintain current approach

will_escalate

Binary sigmoid head. P(escalation) = 0.73 after 3 utterances. Supervised against session-level escalation labels.

final_csat

Regression head, range [1.0, 5.0]. Predicts final CSAT at every timestep. MSE loss with temporal weighting.

churn_risk

Binary sigmoid head. P(churn within 30d) derived from prosodic trajectory patterns. Trains on CRM outcome data.

recommended_tone

Architecture

4-layer Mamba SSM with S4D diagonal state matrices. Prosodic and phonetic features fused into a 256-dim representation, processed with O(n) recurrence. Multi-head output: emotion softmax + VAD regression + vertical state mapping.

Model Pipeline

Audio Input

Feature Extraction

Prosody

28 dimensions

Phonetic

4 dimensions

Fusion Layer (256d)

SSM 1

SSM 2

SSM 3

SSM 4

Global Pool

Emotion

Softmax

VAD

Regression

Prediction

Conv. Level

Feature Extraction

Prosodic Features

Pitch (F0)185Hz

Energy-12.4dB

Jitter1.2%

Shimmer3.8%

Spectral Features

125Hz1kHz8kHz

Temporal Features

4.2

syllables/sec

0.34s

avg pause

78%

voiced ratio

Benchmark Results

73.5%

Weighted Acc

Speaker-disjoint val

73.7%

Unweighted Acc

Mean per-class recall

2.1M

Parameters

S4D-Lin backbone

18K+

Training Data

4 corpora combined

Trained on CREMA-D, RAVDESS, TESS, and Orpheus corpora with speaker-disjoint validation. ProsodySSM outperforms transformer baselines (+8.3% WA) while maintaining O(n) complexity via S4D-Lin initialization. Human inter-annotator agreement on SER is typically 60-70%.

Capabilities

Single model, vertical-adaptive output. Fine-tune on your distribution. Configure thresholds per deployment.

LoRA fine-tuning

Low-rank adapters on the SSM blocks. Train on your labeled data without retraining the base model. Outcome-weighted loss with active learning sample selection.

Two-phase training pipeline
70/30 original/feedback mix ratio
Continuous improvement from production

Alert thresholds

Define per-vertical alert rules via alert_thresholds in VerticalConfig. Webhook dispatch on threshold breach. Composable with any event bus.

Per-vertical threshold config
Webhook dispatch on trigger
Kafka topic routing

Outcome feedback loop

Submit session outcomes (CSAT, escalation, churn) via feedback API. Active learning selects high-value samples. Model improves on your production distribution.

CRM outcome ingestion
Active learning selection
Contradiction-weighted retraining

Parallel ASR deployment

Runs as a sidecar to your STT pipeline. Consumes the same audio stream. Output aligns to word timestamps from any ASR provider.

Provider-agnostic
Word-level timestamp alignment
No pipeline changes required

Vertical taxonomies

8 predefined verticals with domain-specific state enums, metrics, and alert thresholds. Single base model, vertical-adaptive output via VerticalConfig.

Contact center, healthcare, sales, legal...
VAD-based state disambiguation
Per-vertical confidence thresholds

Voice AI agents

Stream prosodic signals to your LLM context window. The agent adapts tone and strategy based on real-time VAD vectors.

Post-call analytics

Batch process recordings. Output emotion timelines, session-level predictions, and vertical-specific metrics for every conversation.

Real-time orchestration

Kafka event streaming on threshold breach. Route to supervisors, trigger script changes, or dispatch webhooks — all at utterance boundaries.

Read the Technical Paper

Quick start

Extract prosodic features. Run inference. Map to your vertical.

Python

from prosody import ProsodyClient

client = ProsodyClient(api_key="your-key")

result = client.analyze(
    audio_file="recording.wav",
    features=["emotion", "prosody"]
)

print(result.emotion)  # "happy"
print(result.valence)  # 0.72
print(result.arousal)  # 0.65

JavaScript

import { Prosody } from '@prosody/sdk';

const client = new Prosody({ apiKey: 'your-key' });

const result = await client.analyze({
  audio: audioBlob,
  features: ['emotion', 'prosody']
});

console.log(result.emotion);  // "happy"
console.log(result.valence);  // 0.72
console.log(result.arousal);  // 0.65

Platform

ProsodyCRM

Management plane for your ProsodySSM deployment. API keys, vertical configuration, transcript analysis, outcome feedback, and model fine-tuning.

API Keys

Generate API keys to integrate Prosody directly into your own apps, pipelines, or services. Full programmatic access.

REST & WebSocket APIs
Python & JS SDKs
Rate limits & usage tracking

Integrations

Connect to AWS Transcribe, Salesforce, HubSpot, Zendesk, and more. One-click OAuth setup.

Cloud storage sync
CRM connectors
Webhook events

Transcript Analysis

Upload recordings or sync from cloud storage. View emotion timelines, word-level annotations, and summaries.

Batch processing
Emotion timeline
Export to CRM

Custom Taxonomies

Define emotion states for your industry. Map base emotions to domain-specific labels with custom thresholds.

Industry presets
State mapping
Threshold tuning

Analytics

Track emotion trends over time. Monitor API usage, identify patterns, and export reports.

Usage metrics
Trend analysis
CSV/PDF export

Team Management

Invite team members with role-based permissions. Admin, member, and viewer roles available.

SSO support
Role permissions
Audit logs

Pro

Model Fine-tuning

Train custom LoRA adapters on your data. Upload labeled samples, fine-tune on GCP, and deploy your own model.

LoRA fine-tuning
Hosted on GCP
Your data, your model

Enterprise

Event Orchestration

Kafka-powered event streaming for real-time emotion pipelines. React to emotion events at scale with sub-100ms latency.

Kafka topics per event type
Dead letter queues
Event replay & audit

Launch Dashboard

Start building

Free tier available. API keys provision instantly. Scale when you need throughput.

Get Started

Contact

Integration architecture, vertical configuration, on-premise deployment, or custom LoRA training. Reach out.

sales@prosody.ai

San Francisco, CA

Prosody intelligencefor voice pipelines.

Output format

Infrastructure

Sub-200ms p99

32-dim feature extraction

Multi-head output

Drop-in integration

On-premise available

800+ QPS per node

Forward prediction

Confidence scales with sequence length

will_escalate

final_csat

churn_risk

recommended_tone

Architecture

Model Pipeline

Feature Extraction

Benchmark Results

Capabilities

LoRA fine-tuning

Alert thresholds

Outcome feedback loop

Parallel ASR deployment

Vertical taxonomies

Quick start

ProsodyCRM

API Keys

Integrations

Transcript Analysis

Custom Taxonomies

Analytics

Team Management

Model Fine-tuning

Event Orchestration

Start building

Contact

Send us a message

Prosody intelligence
for voice pipelines.