Foundations · 2026-04-22 · 8 min read

Visual search 101: what happens when a shopper uploads a photo.

A shopper sees someone's handbag on Instagram. They don't know the brand. They can't describe the pattern. They have a picture. This is the exact query keyword search can't answer — and it's the one Trooply exists to answer in under 200 ms.

The mental model

Forget "image search" for a second. Trooply is a similarity engine. When you index a product, we convert its image into a 768-dimensional vector — a point in a very high-dimensional space — using a vision model called CLIP ViT-L/14. Two products that look alike end up close together in that space. Two products that don't, end up far apart. That's the whole trick.

The clever part of CLIP — and the reason Trooply handles text queries and image queries on the same index — is that CLIP maps images and captions into the same space. A photo of a red leather tote and the phrase "red leather tote" land near each other, because CLIP was trained on 400 million image/caption pairs to enforce exactly that. So when a shopper types "red leather tote" we embed the text, look up the nearest products, and get back the same kind of results we'd get if they'd uploaded a photo.

What the call actually looks like

The minimum viable visual search is two lines. Upload an image URL, get the top-K products back:

curl -X POST https://search.trooply.ai/v1/search/url \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"image_url":"https://cdn.shop.com/photos/bag.jpg","limit":10}'

You get back a results array sorted by similarity, plus a query_time_ms number. Each hit carries its full metadata payload — whatever you indexed the product with — so the storefront renders the card without a second round-trip.

{
  "query_time_ms": 142,
  "count": 10,
  "results": [
    {
      "product_id": "SKU-4821",
      "similarity_score": 0.87,
      "metadata": {"name": "Marla Tote", "price": 189, "category": "Handbags"}
    },
    ...
  ]
}

What Trooply does on the server side of that call

Between curl and the response, five things happen in sequence:

Download & normalise the image. We fetch the URL, compress it, strip the background when useful, and resize to the model's 224×224 input.
Encode to CLIP. The image becomes a 768-dimensional float vector on GPU. ~20 ms on an RTX 5070 Ti.
Pre-filter. If you passed filters (category, price range, in-stock, any custom field), we apply those as a structural filter to narrow the candidate set before the vector search runs.
Vector retrieval from Qdrant. Each tenant has its own Qdrant collection. ANN search returns the top ~30 candidates.
Re-rank. CLIP similarity alone isn't the last word. We blend it with colour-histogram overlap, aspect-ratio distance, category-match score, per-product popularity, and (if configured) merchandising rules. Per-category scoring profiles pick different weight sets for, say, apparel vs electronics.

Detail

On every result we also return a _match_breakdown inside metadata with the raw component scores and a one-line human explanation like "Strong match (87%): your query looks like a handbag; dominant colours overlap (red, black)." The portal's Match Analysis modal reads this to explain why the match ranked where it did.

Beyond a single photo

Once you've used /v1/search/url, three related endpoints exist for slightly different shopper behaviours.

Crop: isolate an object before searching

A shopper uploads a street-style photo with a handbag, jacket, and sneakers all visible. If you care about matching just the bag, pass a bounding box alongside the image:

POST /v1/search/crop
{
  "image_url": "https://cdn.shop.com/street.jpg",
  "bbox": [120, 340, 480, 640]
}

We crop to the region, embed that crop, and run the full pipeline. This is the right endpoint for "search like this patch of the photo" UX.

Multi-image: average several references

Visual-inspiration shoppers pin ideas, not single images. Give us 2–5 references and we'll embed each, average the vectors, and search against the centroid. That suppresses one-off noise (a weird pose, odd lighting) and surfaces products that match the underlying theme.

Fusion: image + text in one query

"Like this, but cheaper and in blue." Pass both an image URL and a query string plus an image_weight (0–1). We interpolate between the image embedding and the text embedding, search once, and the weight controls which side wins when they disagree.

POST /v1/search/fusion
{
  "image_url": "https://cdn.shop.com/jacket.jpg",
  "query": "blue",
  "image_weight": 0.7
}

Why this beats old-school image search

Classical image search matched on pixels — colour histograms, edge detectors, hand-crafted features. It worked for exact-match lookups (near-duplicates, reverse image search) and broke down on almost anything else: a photo of a handbag in a cluttered frame matched random cluttered frames.

CLIP understands concepts, not pixels. That means it generalises across backgrounds, lighting, and even stylised illustrations. A shopper's phone photo of a retail shelf resolves the right product; a product's catalogue shot on a white background resolves the same product; a hand-drawn sketch of the same item gets close. You do not need a duplicate of the exact image in your catalog — you need a conceptually similar one. That's an order-of-magnitude relaxation of the matching requirement.

Where to go next

Once your catalog is indexed, the interesting questions become about control: how do I pin a SKU during a campaign, how do I boost certain brands, how do I suppress out-of-stock, how do I explain a low-confidence result. That's the job of the merchandising engine — read Merchandising without the ticket queue next.