OpenAPI Integration Guide
- This document is intended for enterprise developers and product managers. It introduces the integration architecture, task flow, authentication, API interfaces, and examples of the ListenHub audio generation service, as well as error handling.
I. System Architecture and Overall Process
-
ListenHub is built on a cloud-based audio generation service. Clients communicate through the OpenAPI Gateway, and multiple backend modules collaborate during the voice content generation process. The general workflow is as follows:
- Client request initiation: Applications call the ListenHub OpenAPI Gateway to create an audio generation task.
- Core business processing: After receiving the request, the backend distributes the task to the preprocessing, content generation, and multimodal synthesis engine modules. These modules work together to transform the input text into speech data.
- Real-time text data retrieval: The system provides APIs to obtain real-time streaming text data for each episode, including outlines and scripts. In Podcast mode, text stream generation has a 20–60 second delay, while in FlowSpeech mode, streaming text data is available about 3 seconds after the episode is created.
- Result delivery: Once audio generation is complete, the system uploads the result to cloud storage, and the client can retrieve it via query APIs.
-
Terminology:
- Episode: The basic content unit in the ListenHub system. All functionalities are exposed to users at the episode level. Users can retrieve complete episode information—including audio, text scripts, and metadata—via a unique identifier
EpisodeId
.
- Content Generation Modes:
- Podcast Mode: Generates structured blog-style audio content, supporting single-host and dual-host broadcasting formats. It simulates the conversational style and rhythm of professional podcasts. Debate mode is planned for future release.
- FlowSpeech Mode: An alternative generation mode that differs from Podcast mode in processing logic and output format.
- Speaker (Voice Profile): A core parameter of audio generation that defines the acoustic characteristics of an episode. Each speaker includes the following key attributes:
SpeakerId
: The unique identifier of the voice profile
SpeakerName
: The display name of the voice profile
Language
: The language supported by the voice profile
When creating an episode, users must specify the speakerId
parameter to select the target voice profile for content generation.
The following diagram illustrates the ListenHub process:
sequenceDiagram
participant C as User
participant G as API Gateway
participant F as Generation Engine
participant S as Cloud Storage
Note over C, G: Episode Creation
C->>G: Create Episode API Call
G->>G: Validate Token/Quota/Rights
G->>F: Create Voice Task
F-->>G: Task Completed (episodeID)
G-->>C: Return Result (200 OK, episodeID)
F->>F: Process Text (<1min)
F->>F: Generate Voice (1-2min)
F->>S: Upload Audio File
S-->>F: Upload Confirmation
F->>C: Webhook Confirmation of Completion
Note over C, G: Episode Query - All Episode Information
C->>G: Query Episode Details API Call
G-->>C: Episode Related Information (audioStreamUrl + Episode Info)
C->>S: GET Audio URL (Play)
S-->>C: Return Audio Stream
Note over C, G: Episode Query - Text Streaming
C->>G: Query Episode Text Stream API Call
G-->>C: Return Episode Script/Outline Streaming Data
II. OpenAPI Service Plans and Pricing
Feature Support:
ListenHub OpenAPI v1.0 provides comprehensive audio content generation capabilities with the following core features:
- Creation of episodes in Podcast mode and FlowSpeech mode
- Real-time episode query and status tracking
- Real-time retrieval of streaming text data (outlines and scripts)
Future versions will gradually support:
- Multi-format audio output (MP3/WAV/AAC, etc.)
- Professional script editing and optimization tools
- Enterprise-grade custom voice cloning authorization (up to 5 voices)
Pricing:
OpenAPI services are available only to Business and Enterprise plan users.