Technical Report · GoWave Automation Lab · 2026

Scan-to-Link: inferring the product behind an unknown bottle barcode with prefix k-NN

A lightweight, training-free recommender that turns the digits of an unscanned EAN-13/UPC-A liquor barcode into a ranked list of catalog products — so a shopkeeper can capture real barcodes by tapping a suggestion instead of typing. It exploits the GS1 company-prefix structure and a labeled corpus that grows every time a link is confirmed. Built into LiKAR's point-of-sale.

91.4%
Manufacturer top-1
82.7%
Brand top-1 (recoverable)
97.8%
Brand top-10 (recoverable)
₹0
Cost · no API, no GS1 fee

1. Problem

Indian liquor lists (e.g. the West Bengal Excise registry) publish brand name, pack size and MRP — but no barcodes. Public barcode databases (Open Food Facts, etc.) barely cover Indian IMFL, and when they do they return the foreign-market code (a French Heineken 871…) rather than the Indian-retail 890… bottle. The authoritative source (GS1 India) sells a company prefix to issue your owncodes (₹48k+/yr) — useless for looking up someone else's products. So a liquor POS starts with thousands of products and almost no scannable barcodes.

The only 100%-correct source is the bottle itself. The question this work answers: when a clerk scans a bottle we have never seen, can we guess which catalog product it is — from the digits alone — well enough to make capture a single tap?

2. Signal in the digits

An EAN-13 is not random. Under GS1, the leading digits are the company prefix (the brand owner), followed by an item reference the company assigns to each SKU — typically in contiguous blocks, so size variants of one product sit on adjacent numbers.

8 9 0  2 9 6 7  1 0 0 4 2  1
890 = GS1 India2967 = company (United Spirits)10042 = item ref (Black Dog block)1 = check

Two bottles from the same maker share the green segment; size variants of one product share most of the amber segment.

Therefore the longest common prefix (LCP)between an unknown code and a known one is a direct proxy for “same maker / same brand / same SKU family.” This is the kernel of our model.

3. Model

We use an instance-based (lazy) k-Nearest-Neighbors classifierwith a custom prefix-similarity kernel — no parameters are trained by gradient descent; the “model” isthe labeled corpus, and it updates online. Concretely:

Scan unknown code
k-NN by LCP over corpus
Vote → maker / brand / category
Rank catalog candidates
Clerk taps → link saved (corpus grows)

4. Dataset

The corpus is 775 real EAN-13/UPC-A codes with known product, brand and category, drawn from the official Assam price master and West-Bengal cross-matches shipped in LiKAR. Evaluation uses the 782 labeled Assam codes.

PropertyValue
Labeled barcodes (corpus)775
Distinct brands374
Distinct manufacturer prefixes (7-digit)165
Singleton brands (appear once)198
Singleton manufacturers67

5. Evaluation & results

Protocol: leave-one-out cross-validation. Each code is removed, then ranked against the remaining corpus; we record the rank of its true manufacturer and true brand. We report top-kaccuracy on all codes and on the recoverable subset (where ≥2 codes share the label, i.e. a non-cold-start case). The gap between them is pure data sparsity, which the online corpus closes over time.

TargetSubsetnTop-1Top-5Top-10
Manufacturerall78291.4%91.4%91.4%
Manufacturerrecoverable715100%100%100%
Brandall78261.8%69.4%73.0%
Brandrecoverable58482.7%93.0%97.8%

Reading the numbers. The manufacturer is identified top-1 in 91.4% of cases (100% whenever the maker is not a singleton). Because the manufacturer is known, the clerk sees a short, correct shortlist. The true brand is the single best guess 82.7% of the time on recoverable codes, rising to 97.8% within the top-10— i.e. on a normal scan the right product is almost always one tap away. The lower “all” brand numbers reflect cold-start singletons (a brand seen for the first time has no sibling to match) — exactly the cases the bootstrapping loop fixes.

6. Worked example

Scan 8902967100421 (unknown):

  1. Nearest corpus codes share 8902967100… (LCP = 10) → all Black Dog SKUs.
  2. Manufacturer prefix 8902967 → United Spirits (✓ 98–100% reliable).
  3. Brand vote → Black Dog; category → whisky.
  4. Catalog candidates ranked: Black Dog Triple Gold 750ml / 375ml / 180ml (high) → other United Spirits whiskies (medium).
  5. Clerk taps the right size → 8902967100421 saved; corpus +1.

7. How it is built into LiKAR

ComponentPathRole
Recommenderlib/barcode-suggest.tsPure prefix k-NN: LCP kernel, voting, candidate scoring.
Labeled corpusdata/barcode-corpus.json775 seed instances (brand + category labels).
Suggest API/api/barcode/suggestMerges seed corpus + live real-barcoded products; returns ranked suggestions.
Link API/api/barcode/linkWrites the scanned code onto the chosen product (clash-checked).
Scan-to-link UI/dashboard/inventory/scan-linkHID-scanner input → suggestions → one-tap confirm.

The corpus is the union of the baked seed file and every product in the shop's database that already carries a real numeric barcode— so each confirmed link (and every state's data, e.g. Assam helping West Bengal) compounds into better suggestions. Scanning of already-known barcodes is unchanged; this only assists the unknown case.

8. Related work & prior research

We surveyed the landscape before claiming novelty. Three lines of prior work touch this problem:

SystemDirectionNeedsIN liquorSelf-learning
Verified by GS1 / GEPIRcode → owner (lookup)brand registrationsparse / foreign codeno
Go-UPC, EAN-Search, OFFcode → product (lookup)aggregated dataminimalcrowd
UPC-prefix validation (patents)validate a submissionexisting catalogn/ano
LiKAR scan-to-link (this)code → product (infer & rank)small seed corpusnative (890…)yes (online)

9. Novelty & positioning

Being precise about the delta: the GS1 prefix↔brand-owner association and prefix / k-NN string matching are known (see §8) — we invent neither. Our contribution is the composition: turning prefix k-NN into a forward barcode→product recommender (unknown code → ranked catalog candidates, not a yes/no validation or a registry lookup), wired into a self-bootstrapping scan-to-link loop that runs with no external database, no API and no GS1 fee, for a domain (Indian liquor retail) that the lookup services above do not cover. We did notfind this specific forward-recommender-plus-bootstrapping workflow in the prior art or in any shipping Indian liquor POS we examined; to our knowledge it is novel in this domain — but we state that as a surveyed claim, not an unverifiable “world-first.” The measured top-k results in §5 are the substantive contribution.

10. Limitations

11. Reproducibility

All artifacts ship in the LiKAR repository: the algorithm (lib/barcode-suggest.ts), the corpus (data/barcode-corpus.json), and the APIs/UI above. The evaluation is leave-one-out over the labeled corpus; the metrics in §5 are produced directly from it. The live tool is at /dashboard/inventory/scan-link.