RP-Bench
Help calibrate the roleplay benchmark by rating AI responses. Your votes validate whether our LLM judges agree with real humans.
Arena
Two responses, side by side, models hidden. Pick the better one. Fast and fun — each vote takes 30 seconds.
Rubric Score
Score a single response on 12 dimensions. More detailed — helps us understand what makes RP good or bad, not just which is better.
Results
See aggregated votes and how they compare to the LLM judges.