Burner Phone: Multimodal Android Agent with Gaze-Based Wake Detection

Current

Burner Phone: Multimodal Android Agent with Gaze-Based Wake Detection

A 24/7 always-on Android agent that uses continuous multimodal streaming — gaze detection via front camera and speech via microphone — to activate conversation only when the user looks at and speaks to the phone, without wake words.

Signal

Burner Phone · SouthpawIN · 2026

Context

Burner Phone transforms any Android device (physical phone, Termux install, or emulator) into a continuously active AI listening agent. Unlike traditional voice assistants that rely on fixed wake words ("Hey Siri", "OK Google"), it detects human attention through a multimodal signal: the front camera streams at 15 fps to a Qwen2.5-Omni 3.5B model, the microphone captures audio chunks simultaneously, and the model jointly evaluates {addressing: true, looking: true, speaking: true, confidence: 0.94}. Conversation activates only when gaze and speech align. TTS uses Soprano 80M running on CPU.

Relevance

This project embodies the "always-on local agent" pattern extended into the physical/companion-device dimension. Gaze-based wake detection replaces brittle keyword spotting with a genuinely attention-aware trigger, which is critical for a 24/7 listening device that must filter ambient noise, TV, and other household conversation. Its integration as a skill into the Hermes agent ecosystem (built by Nous Research) also demonstrates the growing pattern of agent skills composing higher-order behaviors from hardware-level sensing.

Current State

The project is in v2.0 with documented metrics: ~500ms attention-to-activation latency, 200ms/second TTS generation, ~8%/hour battery drain in continuous streaming mode, and ~50KB/s network usage for compressed media chunks. It supports both physical Android devices and emulators, and multi-device management across Tailscale networks. Originally built as a Hermes Hackathon project. Source is openly licensed on GitHub under Apache 2.0.

Open Questions

  • Does the gaze+speech detection model run fully locally on the Android device, or is it streaming inference to a connected cloud/gateway?
  • How does it handle privacy — where is camera and microphone data processed, and what is the retention model?
  • Does it expose MCP-compatible tooling for other agent frameworks beyond Hermes?
  • What is the scope of its phone automation capabilities (contacts, messages, notifications, system settings)?

Connections

Connections to agent skill composition patterns, always-on local inference devices, and multimodal wake detection infrastructures. Related to hermes-agent as its primary integration target, mimika-studio for TTS on local hardware, and dimensionalos for physical device agent integration.

Connections

  • Hermes Agent - integrates with Hermes agent as a skill (Current · en)

Related entries

External references

Score

Score derives from linkage, recency, and abstract depth; at-risk merely suggests erosion and does not indicate retirement.

Mediation note

Tooling: Qwen2.5-Omni 3.5B (multimodal model), Soprano 80M (TTS on CPU)

Use: read GitHub repo README for project scope and technical details, assessed linkage to multimodal agent and always-on device patterns

Human role: review, edit, and approve before publication

Limits: signal content may be incomplete; verify primary sources before publishing