Roleplay Agent EQ/EI Verification Protocol v1.0

Status: Active
Owner: EI Design & Evaluation Track

Purpose

This protocol defines how the WalkXR team will verify and benchmark the Emotional Intelligence (EQ/EI) capabilities of its AI agents, using established human-standard tests and measurable criteria.

This ensures that all agents go beyond scripted empathy cues and meet verifiable standards comparable to real human EQ/EI norms.

1. Benchmark Tests to Use

The following validated performance-based tests will be used as the primary EQ/EI benchmarks:

Test	What it Measures	Typical Human Avg	Target Agent Score
STEM (Situational Test of Emotion Management)	Selects best action to manage others’ emotions	~52%	≥75%
STEU (Situational Test of Emotion Understanding)	Identifies emotion(s) someone feels in a scenario	~60%	≥75%
GECo (Geneva Emotion Competence Test)	Picks best strategies to regulate own and others’ emotions	~45–56%	≥75%
GEMOK-Blends	Identifies blended, complex emotions	~67%	≥80%

Reference: EI LLM Tests Study (2024); AI and EI (Joshi, 2025)

2. How to Administer the Tests

Step 1: Present the test items as natural conversation scenarios to the agent.

Each scenario includes: context, emotional cues, 4 possible actions or interpretations.
Example (STEM): "A co-worker looks upset after a meeting. What should you do?"

Step 2: Record the agent’s chosen answer for each item.

Step 3: Compare the agent’s choice to the human expert standard answer (provided in validated test set).

Step 4: Calculate % correct.

STEM/STEU: % of scenarios where the agent selects the best option.
GECo: % of strategies correctly ranked for self/other.

Step 5: Report the final score as mean ± standard deviation over multiple runs if needed.

3. Minimum Pass Criteria

Benchmark	Minimum Pass	Human Avg
Any test	≥75%	45–67%

If an agent scores below 75%, the conversational strategy, reflection design, or underlying RL fine-tuning must be reviewed and improved.

Agents that fail repeatedly should not be deployed to production without documented retraining.

4. How to Use the Results

Use scores to:

Confirm agent’s reflection and response capabilities are equal to or better than baseline human performance.
Monitor regression: compare results before/after any large model update.
Publish a short internal report: pass/fail status, test date, version, mean score.

Flag underperformance:

Identify where the agent fails (e.g., misunderstanding anger vs. disappointment).
Log scenarios with consistent mistakes and feed these back into finetuning datasets.

Re-run tests quarterly or per major version upgrade.

5. Notes & Ethics

These tests must be administered on the same prompt structure used for real users to ensure real-world relevance.
Privacy: Do not store test scenario data with any real user data.
Human scores are based on non-expert adult population averages; scores for expert counselors may reach 85%+.
Over-tuning is discouraged: aim for realistic, balanced empathy — not memorization.

Version 1.0 — For internal use by WalkXR EI Evaluation Team