Technical Report · GoWave Automation Lab · 2026

Scan-to-Link: inferring the product behind an unknown bottle barcode with prefix k-NN

A lightweight, training-free recommender that turns the digits of an unscanned EAN-13/UPC-A liquor barcode into a ranked list of catalog products — so a shopkeeper can capture real barcodes by tapping a suggestion instead of typing. It exploits the GS1 company-prefix structure and a labeled corpus that grows every time a link is confirmed. Built into LiKAR's point-of-sale.

91.4%

Manufacturer top-1

82.7%

Brand top-1 (recoverable)

97.8%

Brand top-10 (recoverable)

₹0

Cost · no API, no GS1 fee

1. Problem

Indian liquor lists (e.g. the West Bengal Excise registry) publish brand name, pack size and MRP — but no barcodes. Public barcode databases (Open Food Facts, etc.) barely cover Indian IMFL, and when they do they return the foreign-market code (a French Heineken 871…) rather than the Indian-retail 890… bottle. The authoritative source (GS1 India) sells a company prefix to issue your owncodes (₹48k+/yr) — useless for looking up someone else's products. So a liquor POS starts with thousands of products and almost no scannable barcodes.

The only 100%-correct source is the bottle itself. The question this work answers: when a clerk scans a bottle we have never seen, can we guess which catalog product it is — from the digits alone — well enough to make capture a single tap?

2. Signal in the digits

An EAN-13 is not random. Under GS1, the leading digits are the company prefix (the brand owner), followed by an item reference the company assigns to each SKU — typically in contiguous blocks, so size variants of one product sit on adjacent numbers.

8 9 0 2 9 6 7 1 0 0 4 2 1

890 = GS1 India2967 = company (United Spirits)10042 = item ref (Black Dog block)1 = check

Two bottles from the same maker share the green segment; size variants of one product share most of the amber segment.

Therefore the longest common prefix (LCP)between an unknown code and a known one is a direct proxy for “same maker / same brand / same SKU family.” This is the kernel of our model.

3. Model

We use an instance-based (lazy) k-Nearest-Neighbors classifierwith a custom prefix-similarity kernel — no parameters are trained by gradient descent; the “model” isthe labeled corpus, and it updates online. Concretely:

Similarity kernel: sim(x, c) = LCP(x, c) / 13 over the digit strings, with item-reference numeric distance as a tie-break.
Neighbourhood: the k=25 corpus codes with the highest LCP (requiring LCP ≥ 6).
Voting: neighbours vote for a brand and a category, weighted by similarity → the predicted manufacturer prefix (first 7 digits, used only when a neighbour shares ≥7), brand and category.
Candidate scoring: each catalog product without a real barcode is scored — 1.0 × LCP/13 if it shares the brand of a neighbour (strong), or 0.5 × LCP/13 if only the manufacturer + category match (weak) — then ranked.
Online bootstrapping (self-supervision): every confirmed link adds a new labeled instance, so accuracy and coverage rise with use. This is a form of human-in-the-loop active learning.

Scan unknown code

k-NN by LCP over corpus

Vote → maker / brand / category

Rank catalog candidates

Clerk taps → link saved (corpus grows)

4. Dataset

The corpus is 775 real EAN-13/UPC-A codes with known product, brand and category, drawn from the official Assam price master and West-Bengal cross-matches shipped in LiKAR. Evaluation uses the 782 labeled Assam codes.

Property	Value
Labeled barcodes (corpus)	775
Distinct brands	374
Distinct manufacturer prefixes (7-digit)	165
Singleton brands (appear once)	198
Singleton manufacturers	67

5. Evaluation & results

Protocol: leave-one-out cross-validation. Each code is removed, then ranked against the remaining corpus; we record the rank of its true manufacturer and true brand. We report top-kaccuracy on all codes and on the recoverable subset (where ≥2 codes share the label, i.e. a non-cold-start case). The gap between them is pure data sparsity, which the online corpus closes over time.

Target	Subset	n	Top-1	Top-5	Top-10
Manufacturer	all	782	91.4%	91.4%	91.4%
Manufacturer	recoverable	715	100%	100%	100%
Brand	all	782	61.8%	69.4%	73.0%
Brand	recoverable	584	82.7%	93.0%	97.8%

Reading the numbers. The manufacturer is identified top-1 in 91.4% of cases (100% whenever the maker is not a singleton). Because the manufacturer is known, the clerk sees a short, correct shortlist. The true brand is the single best guess 82.7% of the time on recoverable codes, rising to 97.8% within the top-10— i.e. on a normal scan the right product is almost always one tap away. The lower “all” brand numbers reflect cold-start singletons (a brand seen for the first time has no sibling to match) — exactly the cases the bootstrapping loop fixes.

6. Worked example

Scan 8902967100421 (unknown):

Nearest corpus codes share 8902967100… (LCP = 10) → all Black Dog SKUs.
Manufacturer prefix 8902967 → United Spirits (✓ 98–100% reliable).
Brand vote → Black Dog; category → whisky.
Catalog candidates ranked: Black Dog Triple Gold 750ml / 375ml / 180ml (high) → other United Spirits whiskies (medium).
Clerk taps the right size → 8902967100421 saved; corpus +1.

7. How it is built into LiKAR

Component	Path	Role
Recommender	`lib/barcode-suggest.ts`	Pure prefix k-NN: LCP kernel, voting, candidate scoring.
Labeled corpus	`data/barcode-corpus.json`	775 seed instances (brand + category labels).
Suggest API	`/api/barcode/suggest`	Merges seed corpus + live real-barcoded products; returns ranked suggestions.
Link API	`/api/barcode/link`	Writes the scanned code onto the chosen product (clash-checked).
Scan-to-link UI	`/dashboard/inventory/scan-link`	HID-scanner input → suggestions → one-tap confirm.

The corpus is the union of the baked seed file and every product in the shop's database that already carries a real numeric barcode— so each confirmed link (and every state's data, e.g. Assam helping West Bengal) compounds into better suggestions. Scanning of already-known barcodes is unchanged; this only assists the unknown case.

8. Related work & prior research

We surveyed the landscape before claiming novelty. Three lines of prior work touch this problem:

Authoritative lookup (GS1). GEPIR — now Verified by GS1 (2024) — resolves a GTIN to its registered brand-owner and attributes. It is reverse lookup and depends on the brand having uploaded data; GS1 itself notes new, discontinued or small-manufacturer codes are often absent. For Indian liquor it is sparse and, when present, returns the origin-market code, not the Indian retail bottle.
Commercial / crowd databases. Go-UPC, EAN-Search, Barcode Spider, Open Food Facts etc. aggregate GTIN→product records. Same reverse direction, same coverage gap for Indian IMFL (empirically confirmed in §1).
Prefix-based brand validation (patents). Prior patents (e.g. US 9,679,321 / 10,872,366 on product-identification validation) use the observation that a brand's products share a common UPC prefix — but to validate whether a newly submitted product legitimately belongs to a brand, given an existing catalog. The prefix↔brand association is therefore established prior art; we do not claim it.

System	Direction	Needs	IN liquor	Self-learning
Verified by GS1 / GEPIR	code → owner (lookup)	brand registration	sparse / foreign code	no
Go-UPC, EAN-Search, OFF	code → product (lookup)	aggregated data	minimal	crowd
UPC-prefix validation (patents)	validate a submission	existing catalog	n/a	no
LiKAR scan-to-link (this)	code → product (infer & rank)	small seed corpus	native (890…)	yes (online)

9. Novelty & positioning

Being precise about the delta: the GS1 prefix↔brand-owner association and prefix / k-NN string matching are known (see §8) — we invent neither. Our contribution is the composition: turning prefix k-NN into a forward barcode→product recommender (unknown code → ranked catalog candidates, not a yes/no validation or a registry lookup), wired into a self-bootstrapping scan-to-link loop that runs with no external database, no API and no GS1 fee, for a domain (Indian liquor retail) that the lookup services above do not cover. We did notfind this specific forward-recommender-plus-bootstrapping workflow in the prior art or in any shipping Indian liquor POS we examined; to our knowledge it is novel in this domain — but we state that as a surveyed claim, not an unverifiable “world-first.” The measured top-k results in §5 are the substantive contribution.

10. Limitations

Cold start. A brand/maker absent from the corpus yields no suggestion until one bottle is linked manually (then it self-heals).
Prefix length varies. GS1 issues shorter prefixes to large firms and pooled GTINs to small ones; we use longest-common-prefix + nearest-neighbour rather than a fixed cut to absorb this.
Brand ≠ SKU. The model is confident on maker and brand; the exact size variant relies on the human tap (a 2–4 item shortlist).
Regional makers. WB-local country-liquor producers are under-represented in the Assam-seeded corpus and start cold.
Evaluated on a 775/782-code corpus; absolute numbers will shift as the corpus scales, but the structure (high manufacturer, top-k brand) is stable.

11. Reproducibility

All artifacts ship in the LiKAR repository: the algorithm (lib/barcode-suggest.ts), the corpus (data/barcode-corpus.json), and the APIs/UI above. The evaluation is leave-one-out over the labeled corpus; the metrics in §5 are produced directly from it. The live tool is at /dashboard/inventory/scan-link.

← Back to LiKAR