Prosody intelligence
for voice pipelines.

An SSM-based model that runs parallel to your ASR, extracting prosodic features and streaming per-utterance classification with forward prediction. O(n) complexity. LoRA fine-tunable on your data.

Mamba SSM<200ms p9932-dim prosodic featuresForward prediction

Output format

Per-utterance emotion classification with valence-arousal-dominance vectors, word-level alignment, and vertical-specific state mappings. Streamed via WebSocket or polled via REST.

wss://api.prosody.ai/v1/streamconnected
sess_8f2a1b · contact_center
Transcript
00:02
CUSTOMERfrustrated 0.84

Yeah, hi. This is the third time I'm calling about this billing issue.

Prosodic features
f0198 Hz
energy-9.8 dB
jitter1.5%
rate5.2/s
Response · utterance 1/8
{
  "emotion": "frustrated",
  "confidence": 0.84,
  "vad": [0.3, 0.65, 0.58],
  "vertical_state": "impatient",
  "escalation_risk": "medium",
  "prosody": { f0: 198, energy: -9.8, ...28 dims },
  "predictions": { ... }
}
Forward predictionsconf: 0.34
will_escalate38%
churn_risk22%
final_csat2.6/5.0
recommended_tone"empathetic"
V
0.30
A
0.65
D
0.58

Infrastructure

Designed for production voice pipelines. Streaming inference, deterministic outputs, horizontal scaling.

Sub-200ms p99

Streaming inference with per-utterance output. Warm start maintains state across the session.

32-dim feature extraction

F0, energy, jitter, shimmer, HNR, MFCCs, spectral centroid, speech rate, pause duration. Extracted per frame.

Multi-head output

Emotion softmax, VAD regression, and vertical-specific state mapping from a single forward pass.

Drop-in integration

REST and WebSocket APIs. Python and JavaScript SDKs. Runs parallel to any STT provider.

On-premise available

Deploy on your infrastructure. Audio never leaves your VPC. SOC 2 Type II compliant.

800+ QPS per node

O(n) inference complexity via SSM architecture. Horizontal scaling with stateless workers.

ConversationPredictor

Forward prediction

A causal GRU that consumes per-utterance ProsodySSM outputs and predicts session-level outcomes at every timestep. 8 predictive heads, O(1) incremental updates, confidence scaling with sequence length.

Confidence scales with sequence length

c = min(1.0, 0.3 + 0.7 · n/W). Predictions sharpen as the GRU accumulates utterance history.

After 3 utterances · Confidence: Low
Escalation Risk42%
Predicted CSAT3.8/5.0
Churn Probability18%
Tone Recommendation
Monitor — maintain current approach

will_escalate

Binary sigmoid head. P(escalation) = 0.73 after 3 utterances. Supervised against session-level escalation labels.

final_csat

Regression head, range [1.0, 5.0]. Predicts final CSAT at every timestep. MSE loss with temporal weighting.

churn_risk

Binary sigmoid head. P(churn within 30d) derived from prosodic trajectory patterns. Trains on CRM outcome data.

recommended_tone

6-class softmax head. Outputs empathetic | calm | enthusiastic | professional | reassuring | apologetic.

Architecture

4-layer Mamba SSM with S4D diagonal state matrices. Prosodic and phonetic features fused into a 256-dim representation, processed with O(n) recurrence. Multi-head output: emotion softmax + VAD regression + vertical state mapping.

Model Pipeline

Audio Input
Feature Extraction
Prosody
28 dimensions
Phonetic
4 dimensions
Fusion Layer (256d)
SSM 1
SSM 2
SSM 3
SSM 4
Global Pool
Emotion
Softmax
VAD
Regression
Prediction
Conv. Level

Feature Extraction

Prosodic Features
Pitch (F0)185Hz
Energy-12.4dB
Jitter1.2%
Shimmer3.8%
Spectral Features
125Hz1kHz8kHz
Temporal Features
4.2
syllables/sec
0.34s
avg pause
78%
voiced ratio

Benchmark Results

73.5%
Weighted Acc
Speaker-disjoint val
73.7%
Unweighted Acc
Mean per-class recall
2.1M
Parameters
S4D-Lin backbone
18K+
Training Data
4 corpora combined

Trained on CREMA-D, RAVDESS, TESS, and Orpheus corpora with speaker-disjoint validation. ProsodySSM outperforms transformer baselines (+8.3% WA) while maintaining O(n) complexity via S4D-Lin initialization. Human inter-annotator agreement on SER is typically 60-70%.

Capabilities

Single model, vertical-adaptive output. Fine-tune on your distribution. Configure thresholds per deployment.

LoRA fine-tuning

Low-rank adapters on the SSM blocks. Train on your labeled data without retraining the base model. Outcome-weighted loss with active learning sample selection.

  • Two-phase training pipeline
  • 70/30 original/feedback mix ratio
  • Continuous improvement from production

Alert thresholds

Define per-vertical alert rules via alert_thresholds in VerticalConfig. Webhook dispatch on threshold breach. Composable with any event bus.

  • Per-vertical threshold config
  • Webhook dispatch on trigger
  • Kafka topic routing

Outcome feedback loop

Submit session outcomes (CSAT, escalation, churn) via feedback API. Active learning selects high-value samples. Model improves on your production distribution.

  • CRM outcome ingestion
  • Active learning selection
  • Contradiction-weighted retraining

Parallel ASR deployment

Runs as a sidecar to your STT pipeline. Consumes the same audio stream. Output aligns to word timestamps from any ASR provider.

  • Provider-agnostic
  • Word-level timestamp alignment
  • No pipeline changes required

Vertical taxonomies

8 predefined verticals with domain-specific state enums, metrics, and alert thresholds. Single base model, vertical-adaptive output via VerticalConfig.

  • Contact center, healthcare, sales, legal...
  • VAD-based state disambiguation
  • Per-vertical confidence thresholds
Voice AI agents

Stream prosodic signals to your LLM context window. The agent adapts tone and strategy based on real-time VAD vectors.

Post-call analytics

Batch process recordings. Output emotion timelines, session-level predictions, and vertical-specific metrics for every conversation.

Real-time orchestration

Kafka event streaming on threshold breach. Route to supervisors, trigger script changes, or dispatch webhooks — all at utterance boundaries.

Quick start

Extract prosodic features. Run inference. Map to your vertical.

Python
from prosody import ProsodyClient

client = ProsodyClient(api_key="your-key")

result = client.analyze(
    audio_file="recording.wav",
    features=["emotion", "prosody"]
)

print(result.emotion)  # "happy"
print(result.valence)  # 0.72
print(result.arousal)  # 0.65
JavaScript
import { Prosody } from '@prosody/sdk';

const client = new Prosody({ apiKey: 'your-key' });

const result = await client.analyze({
  audio: audioBlob,
  features: ['emotion', 'prosody']
});

console.log(result.emotion);  // "happy"
console.log(result.valence);  // 0.72
console.log(result.arousal);  // 0.65
Platform

ProsodyCRM

Management plane for your ProsodySSM deployment. API keys, vertical configuration, transcript analysis, outcome feedback, and model fine-tuning.

API Keys

Generate API keys to integrate Prosody directly into your own apps, pipelines, or services. Full programmatic access.

  • REST & WebSocket APIs
  • Python & JS SDKs
  • Rate limits & usage tracking

Integrations

Connect to AWS Transcribe, Salesforce, HubSpot, Zendesk, and more. One-click OAuth setup.

  • Cloud storage sync
  • CRM connectors
  • Webhook events

Transcript Analysis

Upload recordings or sync from cloud storage. View emotion timelines, word-level annotations, and summaries.

  • Batch processing
  • Emotion timeline
  • Export to CRM

Custom Taxonomies

Define emotion states for your industry. Map base emotions to domain-specific labels with custom thresholds.

  • Industry presets
  • State mapping
  • Threshold tuning

Analytics

Track emotion trends over time. Monitor API usage, identify patterns, and export reports.

  • Usage metrics
  • Trend analysis
  • CSV/PDF export

Team Management

Invite team members with role-based permissions. Admin, member, and viewer roles available.

  • SSO support
  • Role permissions
  • Audit logs
Pro

Model Fine-tuning

Train custom LoRA adapters on your data. Upload labeled samples, fine-tune on GCP, and deploy your own model.

  • LoRA fine-tuning
  • Hosted on GCP
  • Your data, your model
Enterprise

Event Orchestration

Kafka-powered event streaming for real-time emotion pipelines. React to emotion events at scale with sub-100ms latency.

  • Kafka topics per event type
  • Dead letter queues
  • Event replay & audit

Start building

Free tier available. API keys provision instantly. Scale when you need throughput.

Get Started

Contact

Integration architecture, vertical configuration, on-premise deployment, or custom LoRA training. Reach out.

San Francisco, CA

Send us a message