# Roleplay Agent EQ/EI Verification Protocol v1.0

**Status:** `Active`  
**Owner:** EI Design & Evaluation Track

---

## Purpose

This protocol defines how the WalkXR team will verify and benchmark the Emotional Intelligence (EQ/EI) capabilities of its AI agents, using established human-standard tests and measurable criteria.

This ensures that all agents go beyond scripted empathy cues and meet verifiable standards comparable to real human EQ/EI norms.

---

## 1. Benchmark Tests to Use

The following validated performance-based tests will be used as the primary EQ/EI benchmarks:

| Test | What it Measures | Typical Human Avg | Target Agent Score |
|------|------------------|---------------------|---------------------|
| STEM (Situational Test of Emotion Management) | Selects best action to manage others’ emotions | ~52% | ≥75% |
| STEU (Situational Test of Emotion Understanding) | Identifies emotion(s) someone feels in a scenario | ~60% | ≥75% |
| GECo (Geneva Emotion Competence Test) | Picks best strategies to regulate own and others’ emotions | ~45–56% | ≥75% |
| GEMOK-Blends | Identifies blended, complex emotions | ~67% | ≥80% |

Reference: EI LLM Tests Study (2024); AI and EI (Joshi, 2025)

---

## 2. How to Administer the Tests

Step 1: Present the test items as natural conversation scenarios to the agent.
- Each scenario includes: context, emotional cues, 4 possible actions or interpretations.
- Example (STEM): "A co-worker looks upset after a meeting. What should you do?"

Step 2: Record the agent’s chosen answer for each item.

Step 3: Compare the agent’s choice to the human expert standard answer (provided in validated test set).

Step 4: Calculate % correct.
- STEM/STEU: % of scenarios where the agent selects the best option.
- GECo: % of strategies correctly ranked for self/other.

Step 5: Report the final score as mean ± standard deviation over multiple runs if needed.

---

## 3. Minimum Pass Criteria

| Benchmark | Minimum Pass | Human Avg |
|-----------|---------------|-----------|
| Any test | ≥75% | 45–67% |

If an agent scores below 75%, the conversational strategy, reflection design, or underlying RL fine-tuning must be reviewed and improved.

Agents that fail repeatedly should not be deployed to production without documented retraining.

---

## 4. How to Use the Results

Use scores to:
- Confirm agent’s reflection and response capabilities are equal to or better than baseline human performance.
- Monitor regression: compare results before/after any large model update.
- Publish a short internal report: pass/fail status, test date, version, mean score.

Flag underperformance:
- Identify where the agent fails (e.g., misunderstanding anger vs. disappointment).
- Log scenarios with consistent mistakes and feed these back into finetuning datasets.

Re-run tests quarterly or per major version upgrade.

---

## 5. Notes & Ethics

- These tests must be administered on the same prompt structure used for real users to ensure real-world relevance.
- Privacy: Do not store test scenario data with any real user data.
- Human scores are based on non-expert adult population averages; scores for expert counselors may reach 85%+.
- Over-tuning is discouraged: aim for realistic, balanced empathy — not memorization.

---

Version 1.0 — For internal use by WalkXR EI Evaluation Team