Status: Active
Owner: EI Design & Evaluation Track
This protocol defines how the WalkXR team will verify and benchmark the Emotional Intelligence (EQ/EI) capabilities of its AI agents, using established human-standard tests and measurable criteria.
This ensures that all agents go beyond scripted empathy cues and meet verifiable standards comparable to real human EQ/EI norms.
The following validated performance-based tests will be used as the primary EQ/EI benchmarks:
| Test | What it Measures | Typical Human Avg | Target Agent Score |
|---|---|---|---|
| STEM (Situational Test of Emotion Management) | Selects best action to manage others’ emotions | ~52% | ≥75% |
| STEU (Situational Test of Emotion Understanding) | Identifies emotion(s) someone feels in a scenario | ~60% | ≥75% |
| GECo (Geneva Emotion Competence Test) | Picks best strategies to regulate own and others’ emotions | ~45–56% | ≥75% |
| GEMOK-Blends | Identifies blended, complex emotions | ~67% | ≥80% |
Reference: EI LLM Tests Study (2024); AI and EI (Joshi, 2025)
Step 1: Present the test items as natural conversation scenarios to the agent.
Step 2: Record the agent’s chosen answer for each item.
Step 3: Compare the agent’s choice to the human expert standard answer (provided in validated test set).
Step 4: Calculate % correct.
Step 5: Report the final score as mean ± standard deviation over multiple runs if needed.
| Benchmark | Minimum Pass | Human Avg |
|---|---|---|
| Any test | ≥75% | 45–67% |
If an agent scores below 75%, the conversational strategy, reflection design, or underlying RL fine-tuning must be reviewed and improved.
Agents that fail repeatedly should not be deployed to production without documented retraining.
Use scores to:
Flag underperformance:
Re-run tests quarterly or per major version upgrade.
Version 1.0 — For internal use by WalkXR EI Evaluation Team