Skip to main content

Chat with NPCs

Authentication and Real-Time Interactions

This document provides an in-depth explanation of the real-time communication flow. You will use:

  1. POST /token – For user authentication and acquiring a JWT access token.
  2. WebSocket /ws – For real-time, bi-directional communication supporting text or audio-based interactions.

Overview

  1. Obtain an Access Token
    Call the POST /token endpoint with user credentials. The server will respond with a Bearer token, which you can use to authenticate other requests.

  2. Establish a WebSocket Connection

    • Open a websocket connection to wss://api.aarda.ai/ws.
    • On connection, immediately send a JSON message containing the api_token field with the token you obtained from /token.
  3. Initialize Session

    • Send an initialize message that can contain user_uuid (the unique identifier of the requesting user) and session_id (the unique identifier of a session belonging to the user_uuid).
    • The server will create or resume a session and respond with an initialize_response that returns the server-confirmed user_uuid and session_id.
  4. Interact in Real-Time

    • Send subsequent message or audio payloads, always including user_uuid and session_id.
    • Receive text (and optionally audio) responses in real-time.

1. POST /token

The POST /token endpoint handles user authentication. It expects credentials (username, password) in the format of OAuth2PasswordRequestForm.

Endpoint

POST /token

Request Body

FieldTypeDescription
usernamestringUser's username
passwordstringUser's password (plaintext)

Note: This is sent as form data, not JSON.

Example (cURL)

curl -X POST "http://localhost:8000/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "username=alice" \
-d "password=verysecretpassword"

Example (JSON-like representation)

{
"username": "alice",
"password": "verysecretpassword"
}

(In practice, you will submit this in x-www-form-urlencoded format.)

Response

A JSON object containing the token and token type:

FieldTypeDescription
access_tokenstringThe JWT used for future requests
token_typestringTypically "bearer"

Example Response

{
"access_token": "eyJhbGciOiJIUzI1NiIsInR5c...<snip>...T2_H0s",
"token_type": "bearer"
}

Common Failure Scenarios

  • 401 Unauthorized
    If the username or password is incorrect, you'll receive:
    {
    "detail": "Incorrect username or password"
    }

2. WebSocket /ws

Once you have an access token, you can establish a WebSocket connection.
Important: The new flow requires you to pass user_uuid and session_id with each message to track session context.

Connection Steps

  1. Connect
    Open a WebSocket connection to:

    wss://api.aarda.ai/ws
  2. Send Token
    Immediately upon connection, send a JSON message containing your api_token (the JWT you got from /token):

    {
    "api_token": "<YOUR_JWT_HERE>"
    }

    Example (JavaScript):

    const socket = new WebSocket("wss://api.aarda.ai/ws");

    socket.onopen = () => {
    const initData = JSON.stringify({
    api_token: "eyJhbGciOiJIUzI1NiIsInR5c...",
    });
    socket.send(initData);
    };
  3. Authenticate
    The server verifies your token. If valid, the connection remains open. Otherwise, it closes the connection with a 1008 policy violation code.

Subsequent Messages

After successful authentication, you can send/receive multiple types of JSON messages:


2.1 - initialize Message

The first significant message after the token is verified has type = "initialize". It sets up (or resumes) your session context.

Request:

{
"type": "initialize",
"user_uuid": "abc123", // Your unique user ID
"mood": "friendly",
"characterId": "123",
"playerId": "456",
"sceneId": "789",
"forcedKnowledgeBricks": ["Some knowledge..."],
"overrideQSystemPrompt": null,
"overrideNpcChatPrompt": null,
"overridePlayerChatPrompt": null,
"overrideNpcPrompt": null,
"overridePlayerPrompt": null,
"audioSupport": true,
"language": "en-US",
"audioFormat": "pcm_22050" | "mp3_44100_32"
}

Server Response (JSON):

{
"type": "text",
"source": "initialize_response",
"user_uuid": "abc123",
"session_id": 42
}
  • If no user_uuid is provided, the server will generate a new one.
  • Save both user_uuid and session_id because subsequent messages must include them.

2.2 - Subsequent Messages

After initialization, you can send messages of type = "message" or type = "audio".
Important: Each payload must contain the same user_uuid and session_id given (or returned) during the initialization phase.

Text-Based Message

Request:

{
"type": "message",
"user_uuid": "abc123",
"session_id": 42,
"message": "Hello, how are you today?"
}

Server Response (Text - JSON):

{
"type": "text",
"source": "text_response",
"response": "Hello! I'm doing well, thank you for asking.",
"flags_player": [],
"flags_character": [],
"immediate_emotion": "<emotion>",
"accumulated_emotion": "<emotion>",
"tokens_spent": 42
}
  • response: AI text output
  • flags_player: Flags for the triggered context on player message.
  • flags_character: Flags for the triggered context on character message.
  • immediate_emotion: The emotion of the NPC at the moment of the response.
  • accumulated_emotion: The emotion of the NPC at the moment of the response, accumulated from all the messages.
  • tokens_spent: How many tokens this message consumed.

If you have audioSupport enabled, the server will also deliver a separate binary frame containing the audio version of the above text.

Audio-Based Message

Request (JSON):

{
"type": "audio",
"user_uuid": "abc123",
"session_id": "def456",
"encoding": "audio/webm", // or "audio/pcm"
"data": "<base64_or_binary_payload>"
}
  • encoding – The format of your audio data. If set to "audio/webm", the server expects raw binary WEBM_OPUS data. Otherwise, it expects LINEAR16 data.
  • data – The actual audio payload (base64 string or raw bytes).
  • Sample rate - Always 48000.

Server Steps:

  1. Decodes the audio data.
  2. Runs speech-to-text to convert it into text.
  3. Returns the recognized text as:
    {
    "type": "text",
    "source": "transcription",
    "response": "Hi there!"
    }
  4. Processes that recognized text via the chat logic, returning the standard chat response (text + optional audio).

Example Flow:

sequenceDiagram
participant Client
participant Server

Client->>Server: Connect to wss://.../ws
Note over Server: Wait for first message
Client->>Server: {"api_token": "..."}
Server->>Server: Validates token
Server->>Client: Connection accepted
Client->>Server: {"type": "initialize", ...}
Note over Server: Setup session with given parameters
Server->>Client: {"type":"text", "source":"initialize_response", "user_uuid":"abc123", "session_id":42}
Client->>Server: {"type": "audio", "encoding": "audio/webm", "data": "...", "user_uuid":"abc123", "session_id":42}
Server->>Server: Transcribe audio
Server->>Client: {"type":"text", "source":"transcription", "response":"Hi there!"}
Server->>Server: Process "Hi there!" in the conversation
Server->>Client: {"type":"text", "source":"text_response", "response":"Hello back!"}
Server->>Client: <binary audio data>

Error and Disconnection Handling

  • If the api_token is missing or invalid, the server closes the WebSocket with code 1008.
  • If any uncaught exception occurs, the server tries to close the connection gracefully.
  • Handle onclose or onerror events on the client side to know when the connection has been dropped.

FAQ / Common Pitfalls

  1. Where do I pass the JWT token for normal HTTP endpoints?
    You normally include it in the Authorization header as Bearer <ACCESS_TOKEN>.

  2. Does the server automatically store conversation state?
    Yes. The server uses the user_uuid, characterId and playerId as indexes to store the conversation state. When the same user_uuid, characterId and playerId are used, the server will pick up the existing conversation state.

  3. How do I change the voice or audio settings?
    Send them in the initialize message:

    {
    "audioSupport": true,
    "language": "en-US",
    "audioFormat": "pcm_22050"
    }

    The server picks up your preferences for text-to-speech generation.

  4. What if I only want text responses (no audio)?
    Set "audioSupport": false in the initialize message.


Conclusion

The /token endpoint and /ws WebSocket endpoint form a powerful duo for authentication and real-time conversational workflows in your FastAPI application. Use /token to retrieve a secure JWT, and establish a WebSocket connection to /ws for an interactive session supporting both text and audio messages.


Example: Full WebSocket Client Flow (Pseudocode)

Below is a simple example (in pseudocode/JavaScript) illustrating how you might connect, initialize, send messages, and receive text+audio:

// 1. Get token from /token (assume you already have 'accessToken')

// 2. Connect WebSocket
const socket = new WebSocket("wss://api.aarda.ai/ws");

socket.onopen = () => {
// Immediately send api_token
socket.send(JSON.stringify({ api_token: accessToken }));
};

socket.onmessage = (event) => {
if (typeof event.data === "string") {
// JSON-encoded text message
const jsonMessage = JSON.parse(event.data);
if (jsonMessage.source === "initialize_response") {
// Save these for future messages
window.myUserUuid = jsonMessage.user_uuid;
window.mySessionId = jsonMessage.session_id;
console.log("Session initialized. User UUID & Session ID set.");
} else if (jsonMessage.source === "text_response") {
console.log("AI responded:", jsonMessage.response);
} else if (jsonMessage.source === "transcription") {
console.log("Audio transcribed: ", jsonMessage.response);
}
} else {
// Binary data -> audio
const audioBlob = new Blob([event.data], { type: "audio/wav" });
const audioURL = URL.createObjectURL(audioBlob);
const audioElement = new Audio(audioURL);
audioElement.play();
}
};

// 3. Initialize Session
socket.send(
JSON.stringify({
type: "initialize",
user_uuid: "abc123",
mood: "excited",
characterId: 12,
playerId: 43,
sceneId: 0,
audioSupport: true,
language: "en-US",
audioFormat: "mp3_44100_32",
}),
);

// 4. Send a text message
socket.send(
JSON.stringify({
type: "message",
user_uuid: "abc123",
session_id: 42, // Use the sessionId from the server
message: "Hello, what's happening?",
}),
);

// 5. Or send an audio message (base64 example)
socket.send(
JSON.stringify({
type: "audio",
user_uuid: "abc123",
session_id: 42,
encoding: "base64",
data: "UklGRnQAAABXQVZFZm10IBIAAAABAAEAQB8AAIA+AAACABAAZGF0YQAAAAA=",
}),
);

Conclusion

With user_uuid and session_id included in each message, the application can maintain conversations per user and session in real time. Always:

  1. Obtain your JWT via /token.
  2. Open a WebSocket connection to /ws.
  3. Immediately provide your api_token.
  4. Use type = "initialize" to start or resume a session (providing user_uuid and any known session_id).
  5. For each text or audio message, include the same user_uuid and session_id.
  6. Receive text (and optional audio) responses from the server, maintaining context across messages.