# Roleplay Agent EQ/EI Verification Protocol v1.0 **Status:** `Active` **Owner:** EI Design & Evaluation Track --- ## Purpose This protocol defines how the WalkXR team will verify and benchmark the Emotional Intelligence (EQ/EI) capabilities of its AI agents, using established human-standard tests and measurable criteria. This ensures that all agents go beyond scripted empathy cues and meet verifiable standards comparable to real human EQ/EI norms. --- ## 1. Benchmark Tests to Use The following validated performance-based tests will be used as the primary EQ/EI benchmarks: | Test | What it Measures | Typical Human Avg | Target Agent Score | |------|------------------|---------------------|---------------------| | STEM (Situational Test of Emotion Management) | Selects best action to manage others’ emotions | ~52% | ≥75% | | STEU (Situational Test of Emotion Understanding) | Identifies emotion(s) someone feels in a scenario | ~60% | ≥75% | | GECo (Geneva Emotion Competence Test) | Picks best strategies to regulate own and others’ emotions | ~45–56% | ≥75% | | GEMOK-Blends | Identifies blended, complex emotions | ~67% | ≥80% | Reference: EI LLM Tests Study (2024); AI and EI (Joshi, 2025) --- ## 2. How to Administer the Tests Step 1: Present the test items as natural conversation scenarios to the agent. - Each scenario includes: context, emotional cues, 4 possible actions or interpretations. - Example (STEM): "A co-worker looks upset after a meeting. What should you do?" Step 2: Record the agent’s chosen answer for each item. Step 3: Compare the agent’s choice to the human expert standard answer (provided in validated test set). Step 4: Calculate % correct. - STEM/STEU: % of scenarios where the agent selects the best option. - GECo: % of strategies correctly ranked for self/other. Step 5: Report the final score as mean ± standard deviation over multiple runs if needed. --- ## 3. Minimum Pass Criteria | Benchmark | Minimum Pass | Human Avg | |-----------|---------------|-----------| | Any test | ≥75% | 45–67% | If an agent scores below 75%, the conversational strategy, reflection design, or underlying RL fine-tuning must be reviewed and improved. Agents that fail repeatedly should not be deployed to production without documented retraining. --- ## 4. How to Use the Results Use scores to: - Confirm agent’s reflection and response capabilities are equal to or better than baseline human performance. - Monitor regression: compare results before/after any large model update. - Publish a short internal report: pass/fail status, test date, version, mean score. Flag underperformance: - Identify where the agent fails (e.g., misunderstanding anger vs. disappointment). - Log scenarios with consistent mistakes and feed these back into finetuning datasets. Re-run tests quarterly or per major version upgrade. --- ## 5. Notes & Ethics - These tests must be administered on the same prompt structure used for real users to ensure real-world relevance. - Privacy: Do not store test scenario data with any real user data. - Human scores are based on non-expert adult population averages; scores for expert counselors may reach 85%+. - Over-tuning is discouraged: aim for realistic, balanced empathy — not memorization. --- Version 1.0 — For internal use by WalkXR EI Evaluation Team