Products/Phi-4-reasoning-vision

Phi-4-reasoning-vision

Open-weight 15B multimodal model for thinking and GUI agents

Our Take

Microsoft just dropped Phi-4-reasoning-vision, a 15 billion parameter open-weight multimodal model built for thinking and GUI agents. Three Microsoft researchers—Emad Ibrahim, Piroune Balachandran, and Zac Zuo—built this beast to process both text and images while handling complex reasoning tasks that most models choke on. It's open-weight, meaning developers can actually download, inspect, and fine-tune it themselves instead of praying to the closed-source gods.

Most multimodal models are either good at understanding images or good at reasoning—not both. Phi-4-reasoning-vision tries to crack that by combining visual understanding with advanced chain-of-thought reasoning in a single 15B package. For developers building AI agents that need to see screens, read documents, and make decisions, this is exactly the kind of open foundation you'd want to build on. It's not a consumer product you "use"—it's infrastructure. The kind of thing that shows Microsoft still knows how to build real AI, not just wrap ChatGPT in a different skin.

Microsoft Research Blog →Product Hunt →Microsoft GitHub →