RP-Bench
Help calibrate the roleplay benchmark by rating AI responses. Your votes validate whether our LLM judges agree with real humans.
Multi-Turn Arena
ActiveCompare full 12-turn RP sessions side by side. Tests consistency, degradation, and narrative momentum across a whole scene. ~5 min per vote.
Single-Turn Arena
Complete — 2,000+ votesTwo responses side by side, models hidden. Voting is closed — see results below.
Rubric Score
Score a single response on 12 dimensions. More detailed — helps us understand what makes RP good or bad, not just which is better.
Results
See aggregated votes and how they compare to the LLM judges.