Feature · Visual Search

Visual search API for ecommerce, powered by CLIP.

Shoppers find products by uploading a photo, cropping a region, pinning multiple references, or combining image with text. Trooply's visual search API runs on CLIP ViT-L/14, the state-of-the-art vision-language model, and returns results in under 200 milliseconds at the 95th percentile.

Start free Read the deep-dive API reference

Model: CLIP ViT-L/14 (768-dim) Latency: < 200 ms p95 Vector DB: Qdrant (multi-tenant) Endpoints: text · image upload · URL · crop · multi-image · fusion · voice

How it works

From photo to top-K in three steps.

No models to host, no embeddings to manage, no re-ranking to write. The vector database and the scoring stack come with the API.

Step 1

Index your catalog

Send each product's image URL and metadata to POST /v1/products. Trooply downloads the image, encodes it to a 768-dimensional CLIP vector, and stores it in your dedicated Qdrant collection.

Step 2

Search by whatever

Call /v1/search/text, /v1/search/url, or /v1/search/fusion. We embed the query into the same space, run ANN retrieval, and re-rank with colour, aspect ratio, category, and popularity signals.

Step 3

Render + track

Every hit comes back with a similarity_score and full metadata — no second round-trip. Log clicks via /v1/search/feedback and the ranking engine learns from conversions.

What you get

A full visual retrieval stack, managed.

Search by photo

Upload a phone snap, a style reference, or a product shot from a competitor's site. CLIP finds the nearest matches in your catalog regardless of background, lighting, or crop.

Search by text

The same CLIP model encodes queries into the same 768-dim space, so "red leather tote" retrieves against the image index — no separate text-only backend to maintain.

Crop search

Pass a bounding box with the image. We crop to the region of interest before embedding. Perfect for street-style or lifestyle shots with multiple products in frame.

Multi-image reference

Shoppers can pin 2–5 inspiration images. We embed each and search against the averaged centroid — captures the underlying theme, ignores one-off noise.

Fusion queries

"Like this, but cheaper and in blue." Pass an image + text + weight; we interpolate between the two embeddings and run one retrieval.

Voice search

Upload a WebM / WAV / MP3. Gemma 4 E4B transcribes server-side and feeds the transcript into the text-search pipeline. Useful for mobile-first storefronts.

Use cases

Where visual search wins.

Fashion & accessories

"I saw this bag on Instagram."

Keyword search fails when the shopper doesn't know the brand, the collection, or the season. An upload flow converts "I saw a thing" into "here's the thing" in one step.

Home & interiors

"Match this sofa."

Shoppers upload a room photo. Crop to the piece of furniture they're after. Return catalog products that fit the style, scale, and colour.

Electronics

"This cable ended up in my drawer."

Nobody googles "USB-C to Lightning 2m braided black" unless they're a hobbyist. A photo resolves the correct SKU without making the shopper learn the taxonomy.

Beauty

"What shade is this?"

Upload a swatch photo from a magazine. Trooply matches against the product imagery in the catalog. The shopper skips the colour-name maze.

The technical details.

Every indexed product is encoded by CLIP ViT-L/14, OpenAI's 428M-parameter vision-language model. The image goes through a 24-layer Vision Transformer and comes out as a unit-normalised 768-dimensional vector. That vector lives in a per-tenant Qdrant collection with HNSW indexing, which answers approximate-nearest-neighbour queries in a few milliseconds even on catalogs with hundreds of thousands of products.

A shopper query — text, image upload, image URL, crop, multi-image, or audio — goes through the same pipeline:

Pre-filter in Qdrant using any filters the caller passed (category, price range, in-stock, custom fields).
ANN retrieval returns the top ~30 candidates by cosine similarity.
Re-rank blends the CLIP score with colour-histogram overlap, aspect-ratio distance, category-match score, per-product popularity, and (when configured) merchandising rules. Ten per-category scoring profiles tune the blend weights for apparel, electronics, footwear, home, beauty, and the rest.
Hard-gate strong product-type mismatches — if the query looks like a handbag at high confidence and a result looks like a sweater at high confidence, we filter it even if cosine similarity was decent.

Example: search by image URL

curl -X POST https://search.trooply.ai/v1/search/url \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "image_url": "https://cdn.example.com/street-photo.jpg",
    "limit": 10,
    "threshold": 0.2,
    "filters": {"in_stock": true, "price": {"lte": 500}}
  }'

What comes back

Every result carries a similarity_score in [0, 1], the full metadata you indexed with, and a _match_breakdown explaining the signals that contributed to its rank — plus a one-line human explanation the portal's Match Analysis modal can render verbatim.

FAQ

Common questions.

Do I need to train a model on my catalog?

No. CLIP was trained on 400 million image-caption pairs and generalises to virtually every ecommerce category without fine-tuning. You index your products once and visual search starts working.

How is this different from reverse image search?

Reverse image search matches pixels — it works for near-duplicates and breaks on almost everything else. CLIP matches concepts: a photo of a handbag and a catalog shot of the same handbag end up near each other in embedding space even though their pixels are completely different.

What's the latency?

Under 200 milliseconds at the 95th percentile for a catalog of ≤100,000 products. CLIP encoding is ~20 ms on GPU, Qdrant ANN is a few milliseconds, re-ranking is fast CPU work. The biggest variable is network round-trip to your region.

Does this work for non-fashion catalogs?

Yes. We ship 10 per-category scoring profiles — apparel, footwear, accessories, electronics, displays, audio, optics, home, media, beauty — each with its own blend of CLIP similarity, colour overlap, aspect ratio, and category match. Apparel weights colour heavily; electronics weights aspect ratio and category. The profile is auto-selected per query.

Can I search by text and image together?

Yes — that's what POST /v1/search/fusion does. Pass an image URL, a text query, and an image_weight (0–1). We embed both and interpolate between them before retrieval. 0.7 means image dominates; 0.3 means text dominates.

What about privacy of shopper upload photos?

Uploaded images are embedded and discarded — we don't retain the raw pixels. Search logs store only the product IDs that were returned, not the query image. Per-tenant collections mean your catalog is never shared across clients.

Try visual search on your catalog.

Free forever tier handles up to 1,000 products. Upgrade only when conversions justify it.

Get API key How it works, deeper