X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

Xiaoming Ren, Ru Zhen, Chao Li, Yang Song, Qiuxia Hou, Yanhao Zhang†*, Peng Liu, Qi Qi, Quanlong Zheng, Qi Wu, Zhenyi Liao, Binqiang Pan, Haobo Ji, Haonan Lu*

Multi-X Team, OPPO AI Center

Project leader. *Corresponding authors.
X-OmniClaw Local Engine — Omni Perception, Omni Memory, Omni Action

System architecture overview — Omni Perception, Omni Memory, and Omni Action.

Abstract

Edge-native on Android: multimodal perception, on-device execution with cloud LLM reasoning, and memory that closes the loop across Omni Perception, Omni Memory, and Omni Action.

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.

Demos

📷 Demo A1 — Camera-informed execution 📺 Demo A2 — ScreenAvatar execution / screen companion
User
“How much is this bottle of water on Taobao?”
User
“Let’s start the exercises.”
Behavior
• Camera + voice to infer intent
• Jump to target app search (e.g. Taobao)
• Scroll results, capture prices/volumes
Behavior
• Follow the active screen as the primary context
• Push-to-talk + screen understanding
• Multi-step execution with live feedback
Camera-based item recognitionCamera object recognition. Screen companionMulti-step auto execution.
✂️ Demo B — Memory-based one-tap video 📦 Demo C — Instant portal to a Meituan flash-sale page (behavior cloning)
User
“Find parrot-themed photos and make a one-tap video.”
User
“Open Meituan flash deals.”
Behavior
• Build a searchable memory index; filter by “parrot”
• Stage picks into a temp album (e.g. A_latest)
• Jump to CapCut one-tap flow, batch select, export/share
Behavior
• Record once → reusable bookmark/skill
• Later: one sentence to target page
• Fallback if launch fails
Theme searchOne-tap video. Record onceOne-shot navigation.

Core capabilities

X-OmniClaw is an omni-modal mobile agent that unifies smartphone interaction across three pillars: perception as multimodal ingress, memory for continuity and personalization, and action for robust execution and reusable skills.

Omni Perception

As in the report (§3), perception is the system’s multimodal ingress: unified entry, integrated sensing, and scene-grounded intent before execution.

Omni Perception

Multimodal entry and unified ingress. X-OmniClaw consolidates diverse inputs—direct UI triggers, floating widgets, microphone input, scheduled tasks, and external gateways—into one pipeline. For recurring on-device tasks, Android AlarmManager provides a system-level wake-up path so scheduled triggers merge back into the same entry semantics.

Integrated multimodal perception. The phone is modeled as a first-person multimodal system over on-screen UI, real-world camera context, and speech. Camera and screen projection supply visual evidence; ASR transcribes speech in real time; on-device AEC mitigates playback echo. A decoupled streaming pipeline buffers visual history, and a temporal alignment module aligns speech and video via timestamps.

Scene-grounded intent understanding. A VLM interprets the scene with the user query, expanding raw input into intent. Answerable questions return immediately; otherwise the structured intent is handed to the downstream agent loop.

Omni Memory

As in the report (§4), memory couples working context with long-term personal knowledge and Skill–Tool workflows.

Omni Memory — technical report figure

Working memory and long-term user memory. Working memory preserves multimodal runtime context across turns, foreground changes, and app switches—screenshots, distilled observations, and execution state—so tasks can resume without losing place. Long-term memory distills device-resident personal data into persistent artifacts and user-profile representations injected into reasoning.

Gallery and semantic records. Gallery photos become compact semantic records (objects, scenes, events) to support grounded QA, retrieval, and automation.

How memory is built, used, and secured. Skills orchestrate maintenance vs. consumption; tools implement concrete steps. Image pipelines prefer multimodal summarization with metadata fallback. Production is separated from consumption; writes pass filtering/redaction; users control gallery memory and profile injection.

Omni Action

As in the report (§5), action closes the loop with hybrid UI evidence, an agent loop, and trajectory-based skills.

Omni Action — technical report figure

Omni Action in the app ecosystem. Each step follows observation, reasoning, and execution. The observation stack fuses multimodal interface evidence; the loop selects skills, retrieves memory, and returns the next action or a direct reply. Execution spans Android atomic actions and higher-level tools (filesystem, RAG, etc.).

Hybrid UI understanding. XML, on-device grounding, and OCR localize targets: structure when reliable, vision and text when cues are weak or cluttered—especially under ads and dense layouts.

Trajectory-cloned execution. Behavior cloning records UI-layer navigation into named skills; dumpsys-based introspection extracts deeplink/intent shortcuts. Trajectory replay recovers target “addresses” for fast re-entry with fallbacks when UI drifts.

Demo illustrations

Conclusion and Future Work

This report presented X-OmniClaw, an edge-native omni-modal mobile agent for Android that treats the smartphone as a unified substrate for perception, memory, and action. Across the system design, we argued that mobile agency should not be reduced to isolated screenshot-based automation or cloud-hosted remote control. Instead, the phone itself can serve as a first-person computational interface that continuously integrates on-screen UI state, real-world context, speech input, and personalized history into a single executable loop. Building on this view, Omni Perception provides unified ingress and scene-grounded intent understanding, Omni Memory maintains runtime continuity while distilling multimodal device-resident data into persistent personal knowledge, and Omni Action closes the loop by mapping these signals to robust execution, leveraging hybrid UI understanding and behavior cloning to transform high-level goals into reusable, executable skills. The demo scenarios further show how these components come together in practice, enabling real-world copilot assistance, proactive personalized services, and trajectory-cloned execution.

Looking ahead, the evolution of X-OmniClaw focuses on three strategic pillars to further enhance system intelligence and efficiency. First, we aim to incorporate a self-evolving mechanism that iteratively refines execution trajectories, distilling complex reasoning chains into compact representations to minimize token consumption and response latency. Second, the architecture is transitioning toward dynamic memory evolution, implementing semantic consolidation and selective forgetting to ensure the user profile remains relevant and high-quality over time. Finally, we are advancing a device–cloud synergy that prioritizes the privacy-preserving and lightweight advantages of on-device processing for daily tasks, while selectively offloading intensive open-domain reasoning to cloud-based LLMs via secure, intent-aware gateways. Together, these advancements ensure a more resource-efficient, private, and continuously improving intelligent agent experience.

To support open research and user-steerable development, we will release all of our code, assets, and related materials as open source, and we will continue to update the project as the system evolves.

BibTeX

@misc{ren2026xomniclaw,
  title={{X-OmniClaw} Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction},
  author={Ren, Xiaoming and Zhen, Ru and Li, Chao and Song, Yang and Hou, Qiuxia and Zhang, Yanhao and Liu, Peng and Qi, Qi and Zheng, Quanlong and Wu, Qi and Liao, Zhenyi and Pan, Binqiang and Ji, Haobo and Lu, Haonan},
  year={2026},
  eprint={2605.05765},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2605.05765}
}