Перейти к содержанию

On this page SAM: zero-shot сегментация изображений по точкам, рамкам, маскам.

Skill metadata

| | |---|---| |Source| Bundled (installed by default) | |Path| skills/mlops/models/segment-anything | |Version| 1.0.0 | |Author| Orchestra Research | |License| MIT | |Dependencies| segment-anything, transformers>=4.30.0, torch>=1.7.0 | |Tags| Multimodal, Image Segmentation, Computer Vision, SAM, Zero-Shot |

Reference: full SKILL.md

info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.

Segment Anything Model (SAM)

Полное руководство по использованию Segment Anything Model от Meta AI для zero-shot сегментации изображений.

When to use SAM

Используйте SAM когда: * Нужно сегментировать любой объект на изображениях без обучения под конкретную задачу * Вы создаёте интерактивные инструменты разметки с подсказками в виде точек/рамок * Генерируете обучающие данные для других моделей компьютерного зрения * Нужен zero-shot перенос на новые домены изображений * Строите пайплайны обнаружения/сегментации объектов * Обрабатываете медицинские, спутниковые или специализированные изображения

Ключевые особенности: * Zero-shot сегментация : Работает на любых доменах изображений без дообучения * Гибкие подсказки : Точки, ограничивающие рамки или предыдущие маски * Автоматическая сегментация : Автоматически генерирует маски всех объектов * Высокое качество : Обучена на 1,1 миллиарда масок из 11 миллионов изображений * Несколько размеров моделей : ViT-B (самая быстрая), ViT-L, ViT-H (самая точная) * Экспорт в ONNX : Развёртывание в браузерах и на периферийных устройствах

Альтернативы: * YOLO/Detectron2 : Для обнаружения объектов в реальном времени с классами * Mask2Former : Для семантической/паноптической сегментации с категориями * GroundingDINO + SAM : Для сегментации по текстовым подсказкам * SAM 2 : Для сегментации видео

Quick start

Installation

[code] # From GitHub
pip install git+https://github.com/facebookresearch/segment-anything.git

# Optional dependencies  
pip install opencv-python pycocotools matplotlib

# Or use HuggingFace transformers  
pip install transformers

[/code]

Download checkpoints

[code] # ViT-H (largest, most accurate) - 2.4GB
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# ViT-L (medium) - 1.2GB  
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth

# ViT-B (smallest, fastest) - 375MB  
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth

[/code]

Basic usage with SamPredictor

[code] import numpy as np
from segment_anything import sam_model_registry, SamPredictor

# Load model  
sam = sam_model_registry["vit_h"](https://github.com/NousResearch/hermes-agent/blob/main/skills/mlops/models/segment-anything/checkpoint="sam_vit_h_4b8939.pth")  
sam.to(device="cuda")

# Create predictor  
predictor = SamPredictor(sam)

# Set image (computes embeddings once)  
image = cv2.imread("image.jpg")  
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  
predictor.set_image(image)

# Predict with point prompts  
input_point = np.array([[500, 375]])  # (x, y) coordinates  
input_label = np.array([1])  # 1 = foreground, 0 = background

masks, scores, logits = predictor.predict(  
    point_coords=input_point,  
    point_labels=input_label,  
    multimask_output=True  # Returns 3 mask options  
)

# Select best mask  
best_mask = masks[np.argmax(scores)]

[/code]

HuggingFace Transformers

[code] import torch
from PIL import Image
from transformers import SamModel, SamProcessor

# Load model and processor  
model = SamModel.from_pretrained("facebook/sam-vit-huge")  
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")  
model.to("cuda")

# Process image with point prompt  
image = Image.open("image.jpg")  
input_points = [[[450, 600]]]  # Batch of points

inputs = processor(image, input_points=input_points, return_tensors="pt")  
inputs = {k: v.to("cuda") for k, v in inputs.items()}

# Generate masks  
with torch.no_grad():  
    outputs = model(**inputs)

# Post-process masks to original size  
masks = processor.image_processor.post_process_masks(  
    outputs.pred_masks.cpu(),  
    inputs["original_sizes"].cpu(),  
    inputs["reshaped_input_sizes"].cpu()  
)

[/code]

Core concepts

Model architecture

[code] SAM Architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Image Encoder │────▶│ Prompt Encoder │────▶│ Mask Decoder │
│ (ViT) │ │ (Points/Boxes) │ │ (Transformer) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
Image Embeddings Prompt Embeddings Masks + IoU
(computed once) (per prompt) predictions

[/code]

Model variants

Модель| Чекпоинт| Размер| Скорость| Точность ---|---|---|---|---|--- ViT-H| vit_h| 2.4 ГБ| Медленнее всех| Наилучшая ViT-L| vit_l| 1.2 ГБ| Средняя| Хорошая ViT-B| vit_b| 375 МБ| Быстрее всех| Хорошая

Prompt types

Подсказка Описание Сценарий использования
Point (foreground) Клик по объекту Выбор одного объекта
Point (background) Клик вне объекта Исключение областей
Bounding box Прямоугольник вокруг объекта Крупные объекты
Previous mask Маска низкого разрешения Итеративное уточнение
## Interactive segmentation
### Point prompts
[code]
# Single foreground point
input_point = np.array([[500, 375]])
input_label = np.array([1])
masks, scores, logits = predictor.predict(  
    point_coords=input_point,  
    point_labels=input_label,  
    multimask_output=True  
)

# Multiple points (foreground + background)  
input_points = np.array([[500, 375], [600, 400], [450, 300]])  
input_labels = np.array([1, 1, 0])  # 2 foreground, 1 background

masks, scores, logits = predictor.predict(  
    point_coords=input_points,  
    point_labels=input_labels,  
    multimask_output=False  # Single mask when prompts are clear  
)

[/code]

Box prompts

[code] # Bounding box [x1, y1, x2, y2]
input_box = np.array([425, 600, 700, 875])

masks, scores, logits = predictor.predict(  
    box=input_box,  
    multimask_output=False  
)

[/code]

Combined prompts

[code] # Box + points for precise control
masks, scores, logits = predictor.predict(
point_coords=np.array([[500, 375]]),
point_labels=np.array([1]),
box=np.array([400, 300, 700, 600]),
multimask_output=False
)

[/code]

Iterative refinement

[code] # Initial prediction
masks, scores, logits = predictor.predict(
point_coords=np.array([[500, 375]]),
point_labels=np.array([1]),
multimask_output=True
)

# Refine with additional point using previous mask  
masks, scores, logits = predictor.predict(  
    point_coords=np.array([[500, 375], [550, 400]]),  
    point_labels=np.array([1, 0]),  # Add background point  
    mask_input=logits[np.argmax(scores)][None, :, :],  # Use best mask  
    multimask_output=False  
)

[/code]

Automatic mask generation

Basic automatic segmentation

[code] from segment_anything import SamAutomaticMaskGenerator

# Create generator  
mask_generator = SamAutomaticMaskGenerator(sam)

# Generate all masks  
masks = mask_generator.generate(image)

# Each mask contains:  
# - segmentation: binary mask  
# - bbox: [x, y, w, h]  
# - area: pixel count  
# - predicted_iou: quality score  
# - stability_score: robustness score  
# - point_coords: generating point

[/code]

Customized generation

[code] mask_generator = SamAutomaticMaskGenerator(
model=sam,
points_per_side=32, # Grid density (more = more masks)
pred_iou_thresh=0.88, # Quality threshold
stability_score_thresh=0.95, # Stability threshold
crop_n_layers=1, # Multi-scale crops
crop_n_points_downscale_factor=2,
min_mask_region_area=100, # Remove tiny masks
)

masks = mask_generator.generate(image)

[/code]

Filtering masks

[code] # Sort by area (largest first)
masks = sorted(masks, key=lambda x: x['area'], reverse=True)

# Filter by predicted IoU  
high_quality = [m for m in masks if m['predicted_iou'] > 0.9]

# Filter by stability score  
stable_masks = [m for m in masks if m['stability_score'] > 0.95]

[/code]

Batched inference

Multiple images

[code] # Process multiple images efficiently
images = [cv2.imread(f"image_{i}.jpg") for i in range(10)]

all_masks = []  
for image in images:  
    predictor.set_image(image)  
    masks, _, _ = predictor.predict(  
        point_coords=np.array([[500, 375]]),  
        point_labels=np.array([1]),  
        multimask_output=True  
    )  
    all_masks.append(masks)

[/code]

Multiple prompts per image

[code] # Process multiple prompts efficiently (one image encoding)
predictor.set_image(image)

# Batch of point prompts  
points = [  
    np.array([[100, 100]]),  
    np.array([[200, 200]]),  
    np.array([[300, 300]])  
]

all_masks = []  
for point in points:  
    masks, scores, _ = predictor.predict(  
        point_coords=point,  
        point_labels=np.array([1]),  
        multimask_output=True  
    )  
    all_masks.append(masks[np.argmax(scores)])

[/code]

ONNX deployment

Export model

[code] python scripts/export_onnx_model.py \
--checkpoint sam_vit_h_4b8939.pth \
--model-type vit_h \
--output sam_onnx.onnx \
--return-single-mask

[/code]

Use ONNX model

[code] import onnxruntime

# Load ONNX model  
ort_session = onnxruntime.InferenceSession("sam_onnx.onnx")

# Run inference (image embeddings computed separately)  
masks = ort_session.run(  
    None,  
    {  
        "image_embeddings": image_embeddings,  
        "point_coords": point_coords,  
        "point_labels": point_labels,  
        "mask_input": np.zeros((1, 1, 256, 256), dtype=np.float32),  
        "has_mask_input": np.array([0], dtype=np.float32),  
        "orig_im_size": np.array([h, w], dtype=np.float32)  
    }  
)

[/code]

Common workflows

Workflow 1: Annotation tool

[code] import cv2

# Load model  
predictor = SamPredictor(sam)  
predictor.set_image(image)

def on_click(event, x, y, flags, param):  
    if event == cv2.EVENT_LBUTTONDOWN:  
        # Foreground point  
        masks, scores, _ = predictor.predict(  
            point_coords=np.array([[x, y]]),  
            point_labels=np.array([1]),  
            multimask_output=True  
        )  
        # Display best mask  
        display_mask(masks[np.argmax(scores)])

[/code]

Workflow 2: Object extraction

[code] def extract_object(image, point):
\"\"\"Extract object at point with transparent background.\"\"\"
predictor.set_image(image)

    masks, scores, _ = predictor.predict(  
        point_coords=np.array([point]),  
        point_labels=np.array([1]),  
        multimask_output=True  
    )

    best_mask = masks[np.argmax(scores)]

    # Create RGBA output  
    rgba = np.zeros((image.shape[0], image.shape[1], 4), dtype=np.uint8)  
    rgba[:, :, :3] = image  
    rgba[:, :, 3] = best_mask * 255

    return rgba

[/code]

Workflow 3: Medical image segmentation

[code] # Process medical images (grayscale to RGB)
medical_image = cv2.imread("scan.png", cv2.IMREAD_GRAYSCALE)
rgb_image = cv2.cvtColor(medical_image, cv2.COLOR_GRAY2RGB)

predictor.set_image(rgb_image)

# Segment region of interest  
masks, scores, _ = predictor.predict(  
    box=np.array([x1, y1, x2, y2]),  # ROI bounding box  
    multimask_output=True  
)

[/code]

Output format

Mask data structure

[code] # SamAutomaticMaskGenerator output
{
"segmentation": np.ndarray, # H×W binary mask
"bbox": [x, y, w, h], # Bounding box
"area": int, # Pixel count
"predicted_iou": float, # 0-1 quality score
"stability_score": float, # 0-1 robustness score
"crop_box": [x, y, w, h], # Generation crop region
"point_coords": [[x, y]], # Input point
}

[/code]

COCO RLE format

[code] from pycocotools import mask as mask_utils

# Encode mask to RLE  
rle = mask_utils.encode(np.asfortranarray(mask.astype(np.uint8)))  
rle["counts"] = rle["counts"].decode("utf-8")

# Decode RLE to mask  
decoded_mask = mask_utils.decode(rle)

[/code]

Performance optimization

GPU memory

[code] # Use smaller model for limited VRAM
sam = sam_model_registry"vit_b"

# Process images in batches  
# Clear CUDA cache between large batches  
torch.cuda.empty_cache()

[/code]

Speed optimization

[code] # Use half precision
sam = sam.half()

# Reduce points for automatic generation  
mask_generator = SamAutomaticMaskGenerator(  
    model=sam,  
    points_per_side=16,  # Default is 32  
)

# Use ONNX for deployment  
# Export with --return-single-mask for faster inference

[/code]

Common issues

Проблема Решение
Out of memory Используйте модель ViT-B, уменьшите размер изображения
Slow inference Используйте ViT-B, уменьшите points_per_side
Poor mask quality Попробуйте другие подсказки, используйте рамку + точки
Edge artifacts Используйте фильтрацию по stability_score
Small objects missed Увеличьте points_per_side
## References
* Advanced Usage \- Пакетная обработка, дообучение, интеграция
* Troubleshooting \- Частые проблемы и решения

Resources