WalkXR-AI / docs / agents / roleplay_agent_ei_verification_protocol.md
roleplay_agent_ei_verification_protocol.md
Raw

Roleplay Agent EQ/EI Verification Protocol v1.0

Status: Active
Owner: EI Design & Evaluation Track


Purpose

This protocol defines how the WalkXR team will verify and benchmark the Emotional Intelligence (EQ/EI) capabilities of its AI agents, using established human-standard tests and measurable criteria.

This ensures that all agents go beyond scripted empathy cues and meet verifiable standards comparable to real human EQ/EI norms.


1. Benchmark Tests to Use

The following validated performance-based tests will be used as the primary EQ/EI benchmarks:

Test What it Measures Typical Human Avg Target Agent Score
STEM (Situational Test of Emotion Management) Selects best action to manage others’ emotions ~52% ≥75%
STEU (Situational Test of Emotion Understanding) Identifies emotion(s) someone feels in a scenario ~60% ≥75%
GECo (Geneva Emotion Competence Test) Picks best strategies to regulate own and others’ emotions ~45–56% ≥75%
GEMOK-Blends Identifies blended, complex emotions ~67% ≥80%

Reference: EI LLM Tests Study (2024); AI and EI (Joshi, 2025)


2. How to Administer the Tests

Step 1: Present the test items as natural conversation scenarios to the agent.

  • Each scenario includes: context, emotional cues, 4 possible actions or interpretations.
  • Example (STEM): "A co-worker looks upset after a meeting. What should you do?"

Step 2: Record the agent’s chosen answer for each item.

Step 3: Compare the agent’s choice to the human expert standard answer (provided in validated test set).

Step 4: Calculate % correct.

  • STEM/STEU: % of scenarios where the agent selects the best option.
  • GECo: % of strategies correctly ranked for self/other.

Step 5: Report the final score as mean ± standard deviation over multiple runs if needed.


3. Minimum Pass Criteria

Benchmark Minimum Pass Human Avg
Any test ≥75% 45–67%

If an agent scores below 75%, the conversational strategy, reflection design, or underlying RL fine-tuning must be reviewed and improved.

Agents that fail repeatedly should not be deployed to production without documented retraining.


4. How to Use the Results

Use scores to:

  • Confirm agent’s reflection and response capabilities are equal to or better than baseline human performance.
  • Monitor regression: compare results before/after any large model update.
  • Publish a short internal report: pass/fail status, test date, version, mean score.

Flag underperformance:

  • Identify where the agent fails (e.g., misunderstanding anger vs. disappointment).
  • Log scenarios with consistent mistakes and feed these back into finetuning datasets.

Re-run tests quarterly or per major version upgrade.


5. Notes & Ethics

  • These tests must be administered on the same prompt structure used for real users to ensure real-world relevance.
  • Privacy: Do not store test scenario data with any real user data.
  • Human scores are based on non-expert adult population averages; scores for expert counselors may reach 85%+.
  • Over-tuning is discouraged: aim for realistic, balanced empathy — not memorization.

Version 1.0 — For internal use by WalkXR EI Evaluation Team