Instructions to use kyutai/personaplex-rl-seamless with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Moshi
How to use kyutai/personaplex-rl-seamless with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "kyutai/personaplex-rl-seamless" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("kyutai/personaplex-rl-seamless") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - Notebooks
- Google Colab
- Kaggle
You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
This model is licensed under a combination of CC BY-NC 4.0 and the NVIDIA Open Model License. By clicking "Agree", you accept the terms of this license.
Log in or Sign Up to review the conditions and access this model content.
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
- Paper: arxiv.org
- Blog post & Audio samples: kyutai.org
Overview
This model is a full-duplex spoken dialogue model post-trained with reinforcement learning (RL) to improve interactivity. Starting from Moshi (Défossez et al., 2024) or PersonaPlex (Roy et al., 2026), our post-training targets the four canonical axes of full-duplex interactivity: pause handling, turn-taking, backchanneling, and user interruption, using axis-specific rewards with GRPO and an LLM Judge reward to preserve response quality.
Compared to the base models, the post-trained models reduce cases where the model inappropriately barges in on the user, substantially improve turn-taking response latency, and promote well-timed backchanneling, as evaluated on both Full-Duplex-Bench v1 (static, using pre-recorded audio input) and Full-Duplex-Bench v2 (dynamic, using real-time multi-turn dialogue).
Training Data
We construct RL training data from Seamless Interaction (Agrawal et al., 2025), a 4,000-hour two-party human conversation corpus in which each speaker is recorded on a separate channel. For each of the four interactivity axes, we use voice activity detection to automatically extract up to 2,000 relevant segments from this corpus.
Models
We release two RL-trained models, one for each base model.
- 🤗 kyutai/moshika-rl-seamless: based on kyutai/moshika-pytorch-bf16
- 🤗 kyutai/personaplex-rl-seamless: based on nvidia/personaplex-7b-v1
Usage
This model uses the PersonaPlex architecture and is compatible with the official PersonaPlex inference code. PersonaPlex extends Moshi with support for dialogue control via a text prompt and voice cloning via an audio prompt.
git clone https://github.com/NVIDIA/personaplex
cd personaplex
pip install moshi/.
python -m moshi.server --hf-repo kyutai/personaplex-rl-seamless
Before the conversation begins, you provide a text prompt and a voice prompt. See the official PersonaPlex README for further details on installation and usage.
In our experiments, we used one of the official text prompts
You enjoy having a good conversation.throughout, except for the User Interruption task of Full-Duplex-Bench v1, where we usedYou are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way.as suggested in the official repository. For the voice prompt, we used a 3-second clip, SC0001.wav, though the web UI only lets you select from the 18 fixed voice prompts provided by NVIDIA.
Bias, Risks, and Limitations
This model is intended for research use only and is not recommended for providing advice or performing any professional duty. It should not be used to impersonate other people or for any malicious purpose.
The rule-based reward design for each axis requires manual engineering and becomes increasingly difficult to scale as the number of axes grows. We have also observed that the conversational style of the training data can affect the model's safety behavior, making the incorporation of safety-aware rewards or constraints into the RL process an important direction for future work.
License
This model's weights have the following lineage, each layer carrying its own license:
| Model | Weights | License |
|---|---|---|
| Moshi | kyutai/moshiko-pytorch-bf16 | CC BY 4.0 |
| PersonaPlex | nvidia/personaplex-7b-v1 delta from Moshi | NVIDIA Open Model License |
| This model | kyutai/personaplex-rl-seamless delta from PersonaPlex | CC BY-NC 4.0 |
For practical purposes, we distribute the final merged weights directly. Users must comply with both the CC BY-NC 4.0 license and the NVIDIA Open Model License. Please refer to each license for the full terms and conditions.
Citation
@article{ohashi2026multifaceted,
title={Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models},
author={Ohashi, Atsumoto and Zeghidour, Neil and D{\'e}fossez, Alexandre and Kharitonov, Eugene},
journal={arXiv preprint arXiv:2606.11167},
year={2026}
}
- Downloads last month
- 63