Panel For Example Panel For Example Panel For Example

PixelLM: First Efficient Pixel-Level Inference Model Without SAM

Author : Adrian September 26, 2025

Overview

Multimodal large models are expanding into fine-grained tasks such as image editing, autonomous driving, and robotics. However, most models remain focused on generating text descriptions of entire images or specific regions, and their pixel-level understanding capabilities, such as object segmentation, are relatively limited.

Limitations of Existing Approaches

Some recent work has explored using multimodal large models to handle user segmentation instructions (for example, "segment the fruits in the image that are rich in vitamin C"). Current methods suffer from two main drawbacks:

  • Inability to handle multiple target objects, which is essential in real-world scenarios.
  • Dependence on pretrained segmentation models like SAM. A single forward pass of SAM requires as much computation as producing over 500 tokens from Llama-7B.

PixelLM

To address these issues, researchers from ByteDance's Smart Creation team, Beijing Jiaotong University, and University of Science and Technology Beijing proposed PixelLM, the first efficient pixel-level inference large model that does not rely on SAM.

Compared with prior work, PixelLM offers:

  • The ability to handle an arbitrary number of open-domain targets and diverse, complex reasoning segmentation tasks.
  • Avoidance of additional, costly segmentation models, improving efficiency and transferability across applications.

Dataset for Multi-Object Reasoning Segmentation

To support model training and evaluation in this research area, the team built the MUSE dataset on top of the LVIS dataset using GPT-4V. MUSE contains over 200,000 question-answer pairs and more than 900,000 instance segmentation masks.