Skip to content
Guide · Indexing · 2026-04-22 · 9 min read

Indexing your catalog.

Your ecommerce platform owns the source of truth for products. Trooply owns a searchable index derived from it. This guide is about keeping the two in sync, starting with a single SKU and scaling to a thousands-product bulk sync.

The product record

A Trooply product needs three things: a stable ID, an image URL, and whatever metadata you want to filter or display on.

POST /v1/products
Authorization: Bearer $TOKEN
Content-Type: application/json

{
  "product_id": "SKU-48120",
  "image_url":  "https://cdn.shop.com/bags/marla.jpg",
  "metadata": {
    "name":     "Marla Tote",
    "price":    189,
    "category": "Handbags",
    "brand":    "Marla & Co",
    "tags":     ["leather", "new-arrival"],
    "in_stock": true
  }
}

Three rules that apply to every indexed product:

  • product_id must be stable. It's the only identifier shoppers and feedback events will be joined on. If your platform mutates SKUs when variants change, use a non-mutating parent ID — not the variant ID.
  • image_url must be a publicly fetchable URL. We download it once to generate the embedding, so it needs to respond within ~10 seconds without auth. https:// only; no file://, no data URIs on indexing (search queries can use data URIs; indexing can't).
  • metadata is a free-form dict — anything JSON-serialisable. We store it verbatim and hand it back on every search hit so you don't need a second round-trip to your DB to render a card.

The metadata convention

You can put whatever you want in metadata, but certain keys are understood by Trooply's built-in logic and shouldn't be repurposed:

KeyUsed for
nameDisplay on result cards; default field in Match Analysis headers.
priceFilterable as a numeric range. Any currency; we don't do conversion — pick one.
image_urlMirrored into metadata automatically so storefronts can render from a single field.
categoryFilterable exact match. Feeds per-category scoring profiles and the category_match merchandising / banner scope.
vendorFilterable exact match. Useful for multi-brand catalogs.
tagsList of strings. Filterable as list-any-of.
badgesList of strings (or {text, bg, fg} objects) rendered on result cards — "New", "Sale", "Hot".
in_stockBoolean. Filter with {"in_stock": true}.
stock_quantityNumeric. Filter with {"stock_quantity": {"gte": 1}}.

Keys starting with _ are reserved for Trooply's internal breakdown fields (_match_breakdown, _score_breakdown, etc.). Don't prefix your own fields with underscore.

Anything else — material, fit, screen size, SPF, whatever — goes straight into metadata. If you want it to drive a filter sidebar or a facet API, declare it via custom fields first so Trooply knows its type. See Custom fields & filters for the full story.

Updates: POST vs PUT

Both accept the same body shape. The only difference is upsert semantics:

  • POST /v1/products — idempotent-ish. If a product with that product_id already exists, the server merges the new fields into the existing record and re-embeds if image_url changed. Safe to call on every product-change event without checking first.
  • PUT /v1/products/{product_id} — canonical update semantics. 404 if the product doesn't exist. Use when you want to be strict about "this update must be to an existing product".

In practice most integrations use POST for both create and update — it simplifies the integration path from "is this a new product?" logic down to "send what we have".

When does re-embedding happen?

Every time image_url changes. Re-embedding is ~20ms of GPU work, amortised into the response — you don't have to do anything special. If you're only changing metadata (price, name, tags), the embedding stays untouched. If you're replacing the image, send the new URL and Trooply re-computes.

Deletes

Per-product delete removes the SKU from the index and invalidates any cached result that contained it:

DELETE /v1/products/SKU-48120

The collection-level DELETE /v1/products (wipe everything) is intentionally absent from the public API reference — it's a catastrophic call for any platform integration. If you ever need to nuke a catalog, do it from the portal.

Bulk indexing

The per-product endpoint is fine up to a few dozen products at a time. For the initial catalog sync — usually hundreds to tens of thousands of products — use the bulk path.

POST /v1/products/bulk
{
  "products": [
    {"product_id": "SKU-1", "image_url": "https://...", "metadata": {...}},
    {"product_id": "SKU-2", "image_url": "https://...", "metadata": {...}},
    ...
  ]
}

Response is a job handle, not the actual indexing result:

{"job_id": "job_b14f...", "status": "queued", "total": 8124}

Poll until done:

GET /v1/jobs/job_b14f...

{
  "job_id":    "job_b14f...",
  "status":    "running",
  "total":     8124,
  "processed": 2410,
  "failed":    3,
  "failures":  [
    {"product_id": "SKU-98", "error": "image_download_timeout"},
    ...
  ]
}

Statuses progress queued → running → completed (or failed in the unhappy path). Poll every 2–5 seconds; don't hammer it every 100 ms.

Batch size

The bulk endpoint accepts up to 1,000 products per call. Split larger catalogs into chunks — we chunk internally for processing either way, but a 10 MB request body is a bad idea regardless.

Partial failures

A bulk job that can't fetch a product's image doesn't abort the whole job — it continues and reports the failing product in failures. Your job-consumer should surface those so the catalog team can fix image URLs and re-index. The common causes are broken CDN URLs, products that lost their primary image, and auth-gated image servers.

Detail

Bulk jobs don't hold a transaction — if the process fails mid-way, the products that already completed stay indexed. You can safely re-run the same bulk payload; products that already exist will be upserted.

Keeping the index in sync with your platform

Three patterns, in order of increasing effort and durability:

Pattern 1: inline on write

In your platform's "product created / updated / deleted" handler, call Trooply directly. Simple, works for small stores, but tightly couples shopper-facing write latency to Trooply's availability. A 500ms Trooply timeout becomes a 500ms admin-panel wait.

Pattern 2: webhook + queue

Your platform emits a webhook on product change. A worker process consumes the webhook and calls Trooply. The platform commit is no longer blocked on Trooply; the worker retries on transient failure. This is the right pattern for anything bigger than a hobby store.

Pattern 3: periodic diff-sync

Independent of webhooks, a scheduled job (hourly or nightly) queries your platform for all products updated-since-last-sync and bulk-indexes them. Catches anything the event stream missed — a webhook that failed, a bulk edit that didn't emit per-product events, a data migration. Run both Patterns 2 and 3 together for production workloads.

When re-indexing is worth it

Most of the time, you don't need to re-index existing products. But two situations warrant it:

  • Your image pipeline changed. If you moved to a new CDN, changed image dimensions, or started generating new crops, re-index so every embedding comes from the new pipeline.
  • A schema change. You added a new field you want to filter on, or a custom field declaration changed. Re-indexing picks up the new metadata convention.

Trooply doesn't charge per re-index on the current plan — the cost is just the time the bulk job takes (roughly 100 products/second).

Next

With the catalog indexed, the next concern is production reliability — handling rate limits, retries, and the 4xx / 5xx surface. Errors & rate limits.