Chat with NPCs
Authentication and Real-Time Interactions
This document provides an in-depth explanation of the real-time communication flow. You will use:
POST /token– For user authentication and acquiring a JWT access token.WebSocket /ws– For real-time, bi-directional communication supporting text or audio-based interactions.
Overview
-
Obtain an Access Token
Call thePOST /tokenendpoint with user credentials. The server will respond with a Bearer token, which you can use to authenticate other requests. -
Establish a WebSocket Connection
- Open a websocket connection to
wss://api.aarda.ai/ws. - On connection, immediately send a JSON message containing the
api_tokenfield with the token you obtained from/token.
- Open a websocket connection to
-
Initialize Session
- Send an
initializemessage that can containuser_uuid(the unique identifier of the requesting user) andsession_id(the unique identifier of a session belonging to the user_uuid). - The server will create or resume a session and respond with an
initialize_responsethat returns the server-confirmeduser_uuidandsession_id.
- Send an
-
Interact in Real-Time
- Send subsequent
messageoraudiopayloads, always includinguser_uuidandsession_id. - Receive text (and optionally audio) responses in real-time.
- Send subsequent
1. POST /token
The POST /token endpoint handles user authentication. It expects credentials (username, password) in the format of OAuth2PasswordRequestForm.
Endpoint
POST /token
Request Body
| Field | Type | Description |
|---|---|---|
| username | string | User's username |
| password | string | User's password (plaintext) |
Note: This is sent as form data, not JSON.
Example (cURL)
curl -X POST "http://localhost:8000/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "username=alice" \
-d "password=verysecretpassword"
Example (JSON-like representation)
{
"username": "alice",
"password": "verysecretpassword"
}
(In practice, you will submit this in x-www-form-urlencoded format.)
Response
A JSON object containing the token and token type:
| Field | Type | Description |
|---|---|---|
| access_token | string | The JWT used for future requests |
| token_type | string | Typically "bearer" |
Example Response
{
"access_token": "eyJhbGciOiJIUzI1NiIsInR5c...<snip>...T2_H0s",
"token_type": "bearer"
}
Common Failure Scenarios
- 401 Unauthorized
If the username or password is incorrect, you'll receive:{
"detail": "Incorrect username or password"
}
2. WebSocket /ws
Once you have an access token, you can establish a WebSocket connection.
Important: The new flow requires you to pass user_uuid and session_id with each message to track session context.
Connection Steps
-
Connect
Open a WebSocket connection to:wss://api.aarda.ai/ws -
Send Token
Immediately upon connection, send a JSON message containing yourapi_token(the JWT you got from/token):{
"api_token": "<YOUR_JWT_HERE>"
}Example (JavaScript):
const socket = new WebSocket("wss://api.aarda.ai/ws");
socket.onopen = () => {
const initData = JSON.stringify({
api_token: "eyJhbGciOiJIUzI1NiIsInR5c...",
});
socket.send(initData);
}; -
Authenticate
The server verifies your token. If valid, the connection remains open. Otherwise, it closes the connection with a1008policy violation code.
Subsequent Messages
After successful authentication, you can send/receive multiple types of JSON messages:
2.1 - initialize Message
The first significant message after the token is verified has type = "initialize". It sets up (or resumes) your session context.
Request:
{
"type": "initialize",
"user_uuid": "abc123", // Your unique user ID
"mood": "friendly",
"characterId": "123",
"playerId": "456",
"sceneId": "789",
"forcedKnowledgeBricks": ["Some knowledge..."],
"overrideQSystemPrompt": null,
"overrideNpcChatPrompt": null,
"overridePlayerChatPrompt": null,
"overrideNpcPrompt": null,
"overridePlayerPrompt": null,
"audioSupport": true,
"language": "en-US",
"audioFormat": "pcm_22050" | "mp3_44100_32"
}
Server Response (JSON):
{
"type": "text",
"source": "initialize_response",
"user_uuid": "abc123",
"session_id": 42
}
- If no user_uuid is provided, the server will generate a new one.
- Save both
user_uuidandsession_idbecause subsequent messages must include them.
2.2 - Subsequent Messages
After initialization, you can send messages of type = "message" or type = "audio".
Important: Each payload must contain the same user_uuid and session_id given (or returned) during the initialization phase.
Text-Based Message
Request:
{
"type": "message",
"user_uuid": "abc123",
"session_id": 42,
"message": "Hello, how are you today?"
}
Server Response (Text - JSON):
{
"type": "text",
"source": "text_response",
"response": "Hello! I'm doing well, thank you for asking.",
"flags_player": [],
"flags_character": [],
"immediate_emotion": "<emotion>",
"accumulated_emotion": "<emotion>",
"tokens_spent": 42
}
response: AI text outputflags_player: Flags for the triggered context on player message.flags_character: Flags for the triggered context on character message.immediate_emotion: The emotion of the NPC at the moment of the response.accumulated_emotion: The emotion of the NPC at the moment of the response, accumulated from all the messages.tokens_spent: How many tokens this message consumed.
If you have audioSupport enabled, the server will also deliver a separate binary frame containing the audio version of the above text.
Audio-Based Message
Request (JSON):
{
"type": "audio",
"user_uuid": "abc123",
"session_id": "def456",
"encoding": "audio/webm", // or "audio/pcm"
"data": "<base64_or_binary_payload>"
}
encoding– The format of your audio data. If set to"audio/webm", the server expects raw binary WEBM_OPUS data. Otherwise, it expects LINEAR16 data.data– The actual audio payload (base64 string or raw bytes).Sample rate- Always 48000.
Server Steps:
- Decodes the audio data.
- Runs speech-to-text to convert it into text.
- Returns the recognized text as:
{
"type": "text",
"source": "transcription",
"response": "Hi there!"
} - Processes that recognized text via the chat logic, returning the standard chat response (text + optional audio).
Example Flow:
sequenceDiagram
participant Client
participant Server
Client->>Server: Connect to wss://.../ws
Note over Server: Wait for first message
Client->>Server: {"api_token": "..."}
Server->>Server: Validates token
Server->>Client: Connection accepted
Client->>Server: {"type": "initialize", ...}
Note over Server: Setup session with given parameters
Server->>Client: {"type":"text", "source":"initialize_response", "user_uuid":"abc123", "session_id":42}
Client->>Server: {"type": "audio", "encoding": "audio/webm", "data": "...", "user_uuid":"abc123", "session_id":42}
Server->>Server: Transcribe audio
Server->>Client: {"type":"text", "source":"transcription", "response":"Hi there!"}
Server->>Server: Process "Hi there!" in the conversation
Server->>Client: {"type":"text", "source":"text_response", "response":"Hello back!"}
Server->>Client: <binary audio data>
Error and Disconnection Handling
- If the
api_tokenis missing or invalid, the server closes the WebSocket with code1008. - If any uncaught exception occurs, the server tries to close the connection gracefully.
- Handle
oncloseoronerrorevents on the client side to know when the connection has been dropped.
FAQ / Common Pitfalls
-
Where do I pass the JWT token for normal HTTP endpoints?
You normally include it in theAuthorizationheader asBearer <ACCESS_TOKEN>. -
Does the server automatically store conversation state?
Yes. The server uses the user_uuid, characterId and playerId as indexes to store the conversation state. When the same user_uuid, characterId and playerId are used, the server will pick up the existing conversation state. -
How do I change the voice or audio settings?
Send them in theinitializemessage:{
"audioSupport": true,
"language": "en-US",
"audioFormat": "pcm_22050"
}The server picks up your preferences for text-to-speech generation.
-
What if I only want text responses (no audio)?
Set"audioSupport": falsein theinitializemessage.
Conclusion
The /token endpoint and /ws WebSocket endpoint form a powerful duo for authentication and real-time conversational workflows in your FastAPI application. Use /token to retrieve a secure JWT, and establish a WebSocket connection to /ws for an interactive session supporting both text and audio messages.
Example: Full WebSocket Client Flow (Pseudocode)
Below is a simple example (in pseudocode/JavaScript) illustrating how you might connect, initialize, send messages, and receive text+audio:
// 1. Get token from /token (assume you already have 'accessToken')
// 2. Connect WebSocket
const socket = new WebSocket("wss://api.aarda.ai/ws");
socket.onopen = () => {
// Immediately send api_token
socket.send(JSON.stringify({ api_token: accessToken }));
};
socket.onmessage = (event) => {
if (typeof event.data === "string") {
// JSON-encoded text message
const jsonMessage = JSON.parse(event.data);
if (jsonMessage.source === "initialize_response") {
// Save these for future messages
window.myUserUuid = jsonMessage.user_uuid;
window.mySessionId = jsonMessage.session_id;
console.log("Session initialized. User UUID & Session ID set.");
} else if (jsonMessage.source === "text_response") {
console.log("AI responded:", jsonMessage.response);
} else if (jsonMessage.source === "transcription") {
console.log("Audio transcribed: ", jsonMessage.response);
}
} else {
// Binary data -> audio
const audioBlob = new Blob([event.data], { type: "audio/wav" });
const audioURL = URL.createObjectURL(audioBlob);
const audioElement = new Audio(audioURL);
audioElement.play();
}
};
// 3. Initialize Session
socket.send(
JSON.stringify({
type: "initialize",
user_uuid: "abc123",
mood: "excited",
characterId: 12,
playerId: 43,
sceneId: 0,
audioSupport: true,
language: "en-US",
audioFormat: "mp3_44100_32",
}),
);
// 4. Send a text message
socket.send(
JSON.stringify({
type: "message",
user_uuid: "abc123",
session_id: 42, // Use the sessionId from the server
message: "Hello, what's happening?",
}),
);
// 5. Or send an audio message (base64 example)
socket.send(
JSON.stringify({
type: "audio",
user_uuid: "abc123",
session_id: 42,
encoding: "base64",
data: "UklGRnQAAABXQVZFZm10IBIAAAABAAEAQB8AAIA+AAACABAAZGF0YQAAAAA=",
}),
);
Conclusion
With user_uuid and session_id included in each message, the application can maintain conversations per user and session in real time. Always:
- Obtain your JWT via
/token. - Open a WebSocket connection to
/ws. - Immediately provide your
api_token. - Use
type = "initialize"to start or resume a session (providinguser_uuidand any knownsession_id). - For each text or audio message, include the same
user_uuidandsession_id. - Receive text (and optional audio) responses from the server, maintaining context across messages.