RP-Bench

Help calibrate the roleplay benchmark by rating AI responses. Your votes validate whether our LLM judges agree with real humans.

Multi-Turn Arena

Compare full 12-turn RP sessions side by side. Tests consistency, degradation, and narrative momentum across a whole scene. ~5 min per vote.

Two responses side by side, models hidden. Voting is closed — see results below.

Score a single response on 12 dimensions. More detailed — helps us understand what makes RP good or bad, not just which is better.

See aggregated votes and how they compare to the LLM judges.