How it works
- Attendee sends audio segments as raw PCM audio via HTTP POST to your configured endpoint
- Your service processes the audio and returns the transcription asynchronously
- The response must follow the expected format (see below)
Configuration
Set these environment variables on your Attendee server:CUSTOM_ASYNC_TRANSCRIPTION_URL(required): The full URL of your transcription endpoint (e.g.,https://192.168.0.1/transcribe)CUSTOM_ASYNC_TRANSCRIPTION_TIMEOUT(optional): Request timeout in seconds (default: 120)
Expected API format
Your transcription service must accept aPOST request with multipart/form-data containing:
audio: The audio file (sent as raw PCM audio, 16-bit linear PCM)sample_rate: The sample rate of the audio file in Hz- Any additional custom parameters you specify in
transcription_settings
- Format: Raw PCM (Pulse Code Modulation)
- Sample width: 16-bit
- Encoding: linear16
- Sample rate: Depends on the meeting source (typically 16000 Hz or 32000 Hz)
- Channels: 1 (mono)
status: Must be"done"for successful transcription, or"error"for failuresresult.transcription.full_transcript: The complete transcription textresult.transcription.utterances: Array of utterance objectsresult.transcription.utterances[].words: Array of word objects with timestampsresult.transcription.utterances[].words[].word: The word textresult.transcription.utterances[].words[].start: Start time in secondsresult.transcription.utterances[].words[].end: End time in seconds
Usage example
When creating a bot, specify thecustom_async_v2 provider in transcription_settings:
headers are sent as HTTP request headers (e.g. for auth tokens), and entries under form_data are sent as multipart form fields alongside the audio file. You can add any custom parameters your service needs.
Minimal example (no custom parameters):
Notes
- No credentials are needed in the Attendee dashboard
- Your service must respond asynchronously within the timeout period
- Audio is sent as raw PCM format (16-bit linear PCM, mono)
- The sample rate varies based on the meeting source (typically 16000 Hz or 32000 Hz)
- Word-level timestamps are supported if your service provides them
- You have full control over the transcription model, language detection, and processing