Engineering

Fuzzy Matching at Scale: How Inlet Resolves 30,000 Ingredient Names

INCI names, CAS numbers, common names, and synonyms — how we use rapidfuzz with multi-threshold matching to handle misspellings, abbreviations, and regional naming conventions.

10 min read

|November 2025

Introduction

When a cosmetics company uploads a product formulation to Inlet, the first step isn't compliance checking — it's ingredient resolution. The uploaded file contains ingredient names as the manufacturer wrote them: sometimes INCI standard, sometimes common names, sometimes abbreviated, sometimes misspelled. Before we can check any ingredient against any regulation, we need to know exactly which ingredient it is.

This is harder than it sounds. Our master database contains 30,991 ingredients, each with up to 6 name variants (INCI name, CAS number, EC number, E number, common names, and synonyms). The uploaded name "Sodium Lauryl Sulfate" could match INCI "Sodium Lauryl Sulfate", or the user might have typed "SLS", "Sodium dodecyl sulfate", or even "Sodim Laryl Sulphate" (misspelled with British spelling). All of these should resolve to the same canonical ingredient.

The Challenge

Ingredient name matching sits at the critical intersection of precision and recall. Getting it wrong in either direction is costly:

False negative

A restricted ingredient isn't matched to its canonical name. The compliance check misses it. A non-compliant product reaches market.

False positive

An ingredient is matched to the wrong canonical name — one that happens to be restricted. A compliant product gets flagged. User trust erodes.

Simple string matching (exact or case-insensitive) catches maybe 60% of uploaded names. The remaining 40% have some form of variation: word order differences ("Sulfate, Sodium Lauryl" vs "Sodium Lauryl Sulfate"), spelling variants ("Sulphate" vs "Sulfate"), abbreviations ("PEG-40" vs "Polyethylene Glycol-40"), or OCR errors from scanned documents.

The Master Database

Our ingredient database is the foundation of the matching pipeline. Each ingredient record carries multiple identifiers:

Ingredient Record Schema

inci_name

Sodium Lauryl Sulfate

Primary key for matching. International Nomenclature of Cosmetic Ingredients.

cas_number

151-21-3

Chemical Abstracts Service registry number. Globally unique chemical identifier.

ec_number

205-788-1

European Commission number. Used in EU REACH registrations.

e_number

E487

Food additive code. Used in food & beverage regulations.

common_names

["SLS", "Sodium dodecyl sulfate"]

JSON array of alternative names, abbreviations, and regional variants.

synonyms

["Sulfuric acid monododecyl ester"]

Chemical synonyms from CosIng and PubChem.

cosing_functions

["Surfactant", "Emulsifying"]

CosIng functional classification. Helps disambiguate similar names.

When matching an uploaded name, we search across all these fields — not just inci_name. A user who uploads a formulation with "151-21-3" instead of "Sodium Lauryl Sulfate" should get the same match. CAS numbers are checked with exact match (they're standardized identifiers), while text fields use fuzzy matching.

AI Document Parsing

Ingredient names don't arrive in a clean list. They're embedded in PDF product labels, Excel spreadsheets, scanned images, and CSV exports from formulation software. Each format requires different extraction logic.

We use Claude to parse uploaded documents. The model receives the document content (text extracted via OCR for images, direct text for PDFs) and a structured prompt asking it to identify and extract ingredient names, concentrations, and functions. The output is a structured JSON array, not free text — this constrains the model's output to match our expected schema.

Parsing accuracy varies by format: structured CSV/XLSX files parse at near 100% accuracy. Well-formatted PDF ingredient lists (the standard INCI list on the back of a product box) parse at 95%+. Handwritten formulation sheets or low-resolution images are the hardest, parsing at 80-90%. Every parsed result is stored as a training sample for future model improvement.

Multi-Threshold Matching

Once we have extracted ingredient names, each name is matched against the master database using rapidfuzz.fuzz.token_sort_ratio. We chose token_sort_ratio over simpler algorithms because it normalizes word order before comparing — handling the common case where ingredient names are listed in different word orders across manufacturers.

Why token_sort_ratio

Consider matching "Polyethylene Glycol Monostearate" against "Monostearate, Polyethylene Glycol". A Levenshtein edit distance would score this poorly because the characters are in completely different positions. But token_sort_ratio first sorts the tokens alphabetically ("Glycol Monostearate Polyethylene" for both), then compares — yielding a perfect score.

This single choice eliminates an entire class of false negatives that would otherwise require manual resolution.

Auto-accepted

score >= 0.90

High-confidence fuzzy match. Mapped directly to canonical INCI name. The user sees the match in the UI but doesn't need to confirm — reducing friction for the common case where the uploaded name is close to correct.

Example: "Sodium Laryl Sulfate" → "Sodium Lauryl Sulfate" (score: 0.96)

Candidate suggestions

0.60 <= score < 0.90

Possible match but not confident enough to auto-accept. The UI presents "did you mean?" candidates ranked by score. The user taps to confirm, swap, or reject. Even auto-accepted matches carry candidates as a safety net.

Example: "PEG-40 Hydrogenated Castor" → candidates: "PEG-40 Hydrogenated Castor Oil" (0.87), "PEG-40 Castor Oil" (0.72)

Auto-created draft

score < 0.60

No match found. Rather than blocking the SKU creation, we auto-create a new ingredient record with review_status='draft' and source='ai_parse'. The SKU proceeds, but the ingredient needs admin review before it's trusted in compliance checks.

Example: "Proprietary Peptide Complex XR-7" → new draft ingredient created

The Candidate System

A subtle but important design decision: every match response includes a candidates list, even for auto-accepted matches. We found that showing alternatives catches the 2-3% of cases where token_sort_ratio picks the wrong canonical name from a set of similar INCI names.

Consider "Cetyl Alcohol" vs "Cetearyl Alcohol". These are different ingredients with different regulatory profiles, but they score 0.92 against each other. Without candidates, the system might auto-accept the wrong one. With candidates, the user sees both options and can swap if needed.

The candidate system turns ingredient matching from a black-box operation into a transparent, auditable process. Every match decision is visible, reviewable, and reversible — critical for a compliance system where incorrect matches have real regulatory consequences.

Auto-Creating Draft Ingredients

When the fuzzy matching pipeline can't find a match (score below 0.60), we don't block the user. Instead, we auto-create a new ingredient record in the master database with:

review_status: "draft" — invisible to other customers until approved
source: "ai_parse" — flagged as AI-created for admin review
ai_confidence: 0.xx — the highest match score achieved (even if below threshold)

This approach has a crucial benefit: it turns every unmatched ingredient into a signal. When admins review draft ingredients, they either confirm it's genuinely new (and add it to the master database with proper identifiers) or realize it's a variant of an existing ingredient (and merge it, improving future matching).

Over time, the master database grows organically from real-world product formulations — not just from government reference lists. This is the "collective intelligence" effect: every product checked makes the system smarter for all users.

Results

Our current matching pipeline processes ingredient lists with the following characteristics:

30,991

Master ingredients

~92%

Auto-accepted rate

~6%

Candidate review rate

~2%

Auto-created drafts

The 92% auto-accept rate means most uploads require no manual ingredient resolution at all. The 6% candidate review rate is quick — users tap the correct match from a ranked list. And the 2% auto-create rate feeds the review pipeline that grows the master database over time.

The matching pipeline runs in under 500ms for a typical ingredient list of 20-40 ingredients. Combined with AI document parsing (1-3 seconds for a PDF), the total time from upload to matched ingredient list is under 5 seconds.

Upload a product file and try it

PDF, spreadsheet, or image. Parsed and matched in seconds.

Get Started