---
name: xona.x402.fi
description: xona.x402.fi provides two AI media generation endpoints: text-to-speech audio synthesis and text-to-image generation via the Qwen multimodal model. Both are single-shot, pay-per-request services billed at $0.05 per call on Solana via x402 payment gating.
host: xona.x402.fi
---

# xona.x402.fi

Xona is a small-footprint AI media generation host offering two independent capabilities: converting text to spoken audio and generating images from text or image prompts. It serves agents that need on-demand media asset creation without subscriptions, using Solana micropayments per request. It does not offer streaming, voice cloning, image analysis, or model fine-tuning.

## When to use this host

Use xona.x402.fi when an agent needs single-shot text-to-speech audio or text-to-image generation with per-request Solana micropayments and no subscription overhead. Do not use it for real-time or streaming audio synthesis, voice cloning, image classification, object detection, or image editing — those capabilities are absent. For image analysis or vision tasks, route to a multimodal vision API. For high-volume or streaming TTS, consider dedicated TTS platforms with streaming support. Both skills are independent and low-cost, making this host suitable for lightweight, infrequent media generation tasks.

## Capabilities

### Audio Synthesis

Converts a text string into a natural-sounding audio file in a single API call, suitable for narration, accessibility output, or spoken responses.

- **`synthesize-text-to-speech`** — Converts a text string into natural-sounding speech audio via Xona's TTS endpoint, billed at $0.05 per request on Solana.

### Image Generation

Generates images from text prompts or combined text-and-image inputs using the Qwen multimodal model, producing a single image per request.

- **`generate-qwen-image`** — Generates images from text and/or image inputs using the Qwen image model via Xona's x402-gated API at $0.05 per request.

## Workflows

### Multimedia Asset Creation

*Use when an agent needs to produce both a visual image and a spoken audio description or narration for the same subject in one pipeline.*

1. **`generate-qwen-image`** — Generate an image from a text prompt describing the subject or scene.
2. **`synthesize-text-to-speech`** — Convert a descriptive or narration text about the same subject into spoken audio to accompany the generated image.

## Skill reference

### `synthesize-text-to-speech`

**Xona Text-to-Speech** — Converts a text string into natural-sounding speech audio via Xona's TTS endpoint, billed at $0.05 per request on Solana.

*Use when:* Use when an agent needs to convert a text string into spoken audio output, such as generating voice narration, accessibility audio, or spoken responses.

*Not for:* Do not use for real-time streaming audio synthesis or voice cloning; this is a single-shot text-to-audio conversion endpoint.

**Inputs:**

- `text` (string, required) — The text content to be converted into speech audio.

**Returns:** Returns synthesized speech audio for the provided text input, delivered as audio content from the Xona TTS endpoint.

**Example:** `{"text": "Hello, welcome to Xona's text-to-speech service."}`

---

### `generate-qwen-image`

**Qwen Image Studio** — Generates images from text and/or image inputs using the Qwen image model via Xona's x402-gated API at $0.05 per request.

*Use when:* Use when an agent needs to generate an image from a text prompt or a combination of text and image inputs using the Qwen multimodal image model.

*Not for:* Do not use for image classification, object detection, or image analysis tasks; this endpoint is for image generation only.

**Inputs:**

- `prompt` (string, required) — Text description of the image to generate.

**Returns:** Returns generated image output from the Qwen image model based on the provided text and/or image inputs.

**Example:** `{"prompt": "A futuristic cityscape at sunset with flying cars"}`

---
