Technical Report · GoWave Automation Lab · 2026
Scan-to-Link: inferring the product behind an unknown bottle barcode with prefix k-NN
A lightweight, training-free recommender that turns the digits of an unscanned EAN-13/UPC-A liquor barcode into a ranked list of catalog products — so a shopkeeper can capture real barcodes by tapping a suggestion instead of typing. It exploits the GS1 company-prefix structure and a labeled corpus that grows every time a link is confirmed. Built into LiKAR's point-of-sale.
1. Problem
Indian liquor lists (e.g. the West Bengal Excise registry) publish brand name, pack size and MRP — but no barcodes. Public barcode databases (Open Food Facts, etc.) barely cover Indian IMFL, and when they do they return the foreign-market code (a French Heineken 871…) rather than the Indian-retail 890… bottle. The authoritative source (GS1 India) sells a company prefix to issue your owncodes (₹48k+/yr) — useless for looking up someone else's products. So a liquor POS starts with thousands of products and almost no scannable barcodes.
The only 100%-correct source is the bottle itself. The question this work answers: when a clerk scans a bottle we have never seen, can we guess which catalog product it is — from the digits alone — well enough to make capture a single tap?
2. Signal in the digits
An EAN-13 is not random. Under GS1, the leading digits are the company prefix (the brand owner), followed by an item reference the company assigns to each SKU — typically in contiguous blocks, so size variants of one product sit on adjacent numbers.
Two bottles from the same maker share the green segment; size variants of one product share most of the amber segment.
Therefore the longest common prefix (LCP)between an unknown code and a known one is a direct proxy for “same maker / same brand / same SKU family.” This is the kernel of our model.
3. Model
We use an instance-based (lazy) k-Nearest-Neighbors classifierwith a custom prefix-similarity kernel — no parameters are trained by gradient descent; the “model” isthe labeled corpus, and it updates online. Concretely:
- Similarity kernel:
sim(x, c) = LCP(x, c) / 13over the digit strings, with item-reference numeric distance as a tie-break. - Neighbourhood: the k=25 corpus codes with the highest LCP (requiring LCP ≥ 6).
- Voting: neighbours vote for a brand and a category, weighted by similarity → the predicted manufacturer prefix (first 7 digits, used only when a neighbour shares ≥7), brand and category.
- Candidate scoring: each catalog product without a real barcode is scored —
1.0 × LCP/13if it shares the brand of a neighbour (strong), or0.5 × LCP/13if only the manufacturer + category match (weak) — then ranked. - Online bootstrapping (self-supervision): every confirmed link adds a new labeled instance, so accuracy and coverage rise with use. This is a form of human-in-the-loop active learning.
4. Dataset
The corpus is 775 real EAN-13/UPC-A codes with known product, brand and category, drawn from the official Assam price master and West-Bengal cross-matches shipped in LiKAR. Evaluation uses the 782 labeled Assam codes.
| Property | Value |
|---|---|
| Labeled barcodes (corpus) | 775 |
| Distinct brands | 374 |
| Distinct manufacturer prefixes (7-digit) | 165 |
| Singleton brands (appear once) | 198 |
| Singleton manufacturers | 67 |
5. Evaluation & results
Protocol: leave-one-out cross-validation. Each code is removed, then ranked against the remaining corpus; we record the rank of its true manufacturer and true brand. We report top-kaccuracy on all codes and on the recoverable subset (where ≥2 codes share the label, i.e. a non-cold-start case). The gap between them is pure data sparsity, which the online corpus closes over time.
| Target | Subset | n | Top-1 | Top-5 | Top-10 |
|---|---|---|---|---|---|
| Manufacturer | all | 782 | 91.4% | 91.4% | 91.4% |
| Manufacturer | recoverable | 715 | 100% | 100% | 100% |
| Brand | all | 782 | 61.8% | 69.4% | 73.0% |
| Brand | recoverable | 584 | 82.7% | 93.0% | 97.8% |
Reading the numbers. The manufacturer is identified top-1 in 91.4% of cases (100% whenever the maker is not a singleton). Because the manufacturer is known, the clerk sees a short, correct shortlist. The true brand is the single best guess 82.7% of the time on recoverable codes, rising to 97.8% within the top-10— i.e. on a normal scan the right product is almost always one tap away. The lower “all” brand numbers reflect cold-start singletons (a brand seen for the first time has no sibling to match) — exactly the cases the bootstrapping loop fixes.
6. Worked example
Scan 8902967100421 (unknown):
- Nearest corpus codes share 8902967100… (LCP = 10) → all Black Dog SKUs.
- Manufacturer prefix 8902967 → United Spirits (✓ 98–100% reliable).
- Brand vote → Black Dog; category → whisky.
- Catalog candidates ranked: Black Dog Triple Gold 750ml / 375ml / 180ml (high) → other United Spirits whiskies (medium).
- Clerk taps the right size →
8902967100421saved; corpus +1.
7. How it is built into LiKAR
| Component | Path | Role |
|---|---|---|
| Recommender | lib/barcode-suggest.ts | Pure prefix k-NN: LCP kernel, voting, candidate scoring. |
| Labeled corpus | data/barcode-corpus.json | 775 seed instances (brand + category labels). |
| Suggest API | /api/barcode/suggest | Merges seed corpus + live real-barcoded products; returns ranked suggestions. |
| Link API | /api/barcode/link | Writes the scanned code onto the chosen product (clash-checked). |
| Scan-to-link UI | /dashboard/inventory/scan-link | HID-scanner input → suggestions → one-tap confirm. |
The corpus is the union of the baked seed file and every product in the shop's database that already carries a real numeric barcode— so each confirmed link (and every state's data, e.g. Assam helping West Bengal) compounds into better suggestions. Scanning of already-known barcodes is unchanged; this only assists the unknown case.
8. Related work & prior research
We surveyed the landscape before claiming novelty. Three lines of prior work touch this problem:
- Authoritative lookup (GS1). GEPIR — now Verified by GS1 (2024) — resolves a GTIN to its registered brand-owner and attributes. It is reverse lookup and depends on the brand having uploaded data; GS1 itself notes new, discontinued or small-manufacturer codes are often absent. For Indian liquor it is sparse and, when present, returns the origin-market code, not the Indian retail bottle.
- Commercial / crowd databases. Go-UPC, EAN-Search, Barcode Spider, Open Food Facts etc. aggregate GTIN→product records. Same reverse direction, same coverage gap for Indian IMFL (empirically confirmed in §1).
- Prefix-based brand validation (patents). Prior patents (e.g. US 9,679,321 / 10,872,366 on product-identification validation) use the observation that a brand's products share a common UPC prefix — but to validate whether a newly submitted product legitimately belongs to a brand, given an existing catalog. The prefix↔brand association is therefore established prior art; we do not claim it.
| System | Direction | Needs | IN liquor | Self-learning |
|---|---|---|---|---|
| Verified by GS1 / GEPIR | code → owner (lookup) | brand registration | sparse / foreign code | no |
| Go-UPC, EAN-Search, OFF | code → product (lookup) | aggregated data | minimal | crowd |
| UPC-prefix validation (patents) | validate a submission | existing catalog | n/a | no |
| LiKAR scan-to-link (this) | code → product (infer & rank) | small seed corpus | native (890…) | yes (online) |
9. Novelty & positioning
Being precise about the delta: the GS1 prefix↔brand-owner association and prefix / k-NN string matching are known (see §8) — we invent neither. Our contribution is the composition: turning prefix k-NN into a forward barcode→product recommender (unknown code → ranked catalog candidates, not a yes/no validation or a registry lookup), wired into a self-bootstrapping scan-to-link loop that runs with no external database, no API and no GS1 fee, for a domain (Indian liquor retail) that the lookup services above do not cover. We did notfind this specific forward-recommender-plus-bootstrapping workflow in the prior art or in any shipping Indian liquor POS we examined; to our knowledge it is novel in this domain — but we state that as a surveyed claim, not an unverifiable “world-first.” The measured top-k results in §5 are the substantive contribution.
10. Limitations
- Cold start. A brand/maker absent from the corpus yields no suggestion until one bottle is linked manually (then it self-heals).
- Prefix length varies. GS1 issues shorter prefixes to large firms and pooled GTINs to small ones; we use longest-common-prefix + nearest-neighbour rather than a fixed cut to absorb this.
- Brand ≠ SKU. The model is confident on maker and brand; the exact size variant relies on the human tap (a 2–4 item shortlist).
- Regional makers. WB-local country-liquor producers are under-represented in the Assam-seeded corpus and start cold.
- Evaluated on a 775/782-code corpus; absolute numbers will shift as the corpus scales, but the structure (high manufacturer, top-k brand) is stable.
11. Reproducibility
All artifacts ship in the LiKAR repository: the algorithm (lib/barcode-suggest.ts), the corpus (data/barcode-corpus.json), and the APIs/UI above. The evaluation is leave-one-out over the labeled corpus; the metrics in §5 are produced directly from it. The live tool is at /dashboard/inventory/scan-link.