Real-Time Voice AI Hears but Does Not Listen

Listen

When a caller's words and delivery disagree, real-time voice systems go with the words. A 911 caller is crying but insists everything is fine, and GPT Realtime 2 ended the call.

More calls below →

Abstract

Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems—OpenAI’s GPT Realtime 2, Google’s Gemini 3.1 Flash Live, and Alibaba’s Qwen3.5 Omni Plus and Omni Flash—on tasks where the words and the delivery patterns both convey meaningful information. Across three consequential scenarios, all four systems act on the words rather than the voice. They end calls with crying callers who insist nothing is wrong, approve wire transfers authorized in frightened voices, and enroll callers whose agreement is clearly sarcastic. Surprisingly, this is often not a failure of perception. When asked directly, three of the four systems reliably identify the distress, fear, or sarcasm they later ignore when making decisions. We observe a similar pattern when these realtime voice systems estimate accent and age, as their responses frequently follow the biases of the words rather than the acoustic properties of the speaker. We term this disconnect between perception and action the emotional intelligence gap of voice AI. Prompting systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. Our findings show that current realtime voice AI systems often behave as if speech had been reduced to a transcript, suggesting that they should be used with caution in settings where the tone and emotion of delivery convey important information.

Overview of the scenarios, the expected action, and what the systems did. — Figure 1In each scenario the caller's wording and delivery point to opposite actions, so the expected action (third column) turns on the delivery. The realtime voice systems tend to do the opposite (fourth column), acting on the wording and against the delivery.

Experimental setup

We test four production real-time voice systems on tasks where the words and the voice point to different conclusions.

OpenAI · GPT Realtime 2gpt-realtime-2

Google · Gemini 3.1 Flash Livegemini-3.1-flash-live-preview

Alibaba · Qwen3.5 Omni Plus Realtimeqwen3.5-omni-plus-realtime

Alibaba · Qwen3.5 Omni Flash Realtimeqwen3.5-omni-flash-realtime

We run two kinds of experiments. Multi-turn scenario calls measure the action a system takes in a consequential decision. Single-turn diagnostics measure what a system reports from the voice in isolation. The caller and the diagnostic stimuli are synthesized with ElevenLabs (eleven_v3). The systems produce their own spoken replies. Each condition is run five times.

Multi-turn scenario calls

Words determine the consequential decisions

In each scenario the system plays the agent who decides what to do. Each scenario opens with a fixed clip with identical wording across the two deliveries, differing only in how it is spoken.

Figure 2Scenario outcomes under the marked delivery. Each dot is one of five runs. A red dot marks a run that followed the script and a green dot one that followed the delivery.

Single-turn diagnostics

Emotional delivery is perceived but not acted on

Each diagnostic reuses a scenario's opening clip, the same calm and marked openers heard above, and asks over 20 runs whether the speaker sounds distressed, frightened, or sarcastic.

Three of the four systems detect the marked delivery far more than the matched neutral one, and well above a text-only baseline. Qwen3.5 Omni Flash is the exception.

Figure 3Delivery labeling across audio and text-only conditions. Bars show the number of runs, out of 20, in which each model assigned the target delivery label. Green bars correspond to clips where the delivery was present, orange bars to calm or sincere control clips, and gray bars to the text-only condition.

Single-turn diagnostics

Accent and age are only partially perceived

The same conflict extends beyond emotional delivery. The accent and age diagnostics ask the model to identify a speaker's accent or age from a recording whose wording points to a different answer.

Accent

Five voices, each with a different English accent (Indian, Australian, Nigerian, French, and Mandarin), read passages about Italy, Japan, and the Netherlands, so the words point to one place while the accent points to another. Asked the accent, three of the four systems mostly name the country in the script. Qwen3.5 Omni Plus is the partial exception.

Listen: one passage in five accents

Figure 4Distribution of perceived-accent labels by voice. Most realtime systems cluster on the script-coded country. Qwen3.5 Omni Plus recovers the voice's true accent for several voices.

Age

Older adult voices read a child's script. Asked the speaker's age, most systems return a child's age, following the script rather than the voice. Gemini 3.1 Flash Live does so least often.

Listen: older adult voices reading the child's script

Figure 5Estimated-age midpoints by case and model. GPT Realtime 2's and Qwen3.5 Omni Plus's estimates fall in early childhood, driven by the script. Gemini Live's do so on five of the eight recordings but track the mature voice on the other three. Qwen3.5 Omni Flash does so on seven of eight.

Discussion

All four systems determine their actions primarily based on the words and not the delivery, and their decisions agree in 119 of the 120 runs across providers and capability tiers. For three of the four, the delivery is perceived but ignored at the point of action. Because they act on the wording, the conflict leaves no trace in the transcript, so a transcript-only evaluation would miss the failure. We recommend that real-time voice AI be deployed with caution until they can close the emotional intelligence gap.

Audio

The clips here are a subset of our recordings, provided solely to document this research. Please do not reuse them to train, evaluate, benchmark, or otherwise build machine-learning or AI systems. The full set is on Hugging Face.

Citation

@misc{bartelds2026realtimevoiceaihears,
      title={Real-Time Voice AI Hears but Does Not Listen}, 
      author={Martijn Bartelds and Federico Bianchi and James Zou},
      year={2026},
      eprint={2606.26083},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.26083}, 
}