SEMANTIC FUZZY MATCHING WITH EMBEDDINGS
OVERVIEW
Traditional string similarity measures work well for typographical errors and simple variants. They are less reliable when meaning is the principal signal. Embeddings encode text as vectors so that semantically similar values appear close together. This improves recall for deduplication and matching tasks when records use different phrasing, abbreviations or paraphrases.
WHEN TO USE SEMANTIC MATCHING
- Different phrasing: "NYC" versus "New York City".
- Synonyms or industry jargon that lexical measures miss.
- Short fragments or addresses where context disambiguates meaning.
RECOMMENDED PIPELINE
- Normalise text with NORMALIZE.
- Generate candidates using Flookup functions such as FLOOKUP, ULIST or DEDUPE.
- Compute embeddings and index with an ANN library (for example FAISS or Annoy).
- Verify candidates with semantic scores plus a secondary lexical check; surface uncertain cases for review.
PYTHON EXAMPLE (SENTENCE-TRANSFORMERS + FAISS)
This minimal example demonstrates building an index and querying it. For production add batching, persistence and monitoring.
pip install sentence-transformers faiss-cpu
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
records = [
'Acme Corporation',
'Acme Corp.',
'Acme, Inc',
'Apple Inc',
'New York City'
]
embs = model.encode(records, convert_to_numpy=True)
# normalise and index for cosine similarity
faiss.normalize_L2(embs)
d = embs.shape[1]
index = faiss.IndexFlatIP(d)
index.add(embs)
query = 'Acme Corporation'
q_emb = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(q_emb)
distances, indices = index.search(q_emb, k=5)
for score, idx in zip(distances[0], indices[0]):
print(score, records[idx])
VERIFICATION AND THRESHOLDS
Embeddings increase recall but require careful verification to avoid false positives. Use a conservative similarity threshold (for example 0.75–0.9), add a secondary lexical test such as FUZZYSIM, and route borderline matches into a review queue.
GOOGLE SHEETS INTEGRATION
Keep workflows in the spreadsheet by combining Flookup and a lightweight external service. Pattern:
- Normalise and produce candidates in the sheet.
- Call an external matching endpoint from Apps Script for candidate verification only.
- Store scores and decisions back in the sheet and schedule rechecks with Schedule Functions.
function getSemanticMatches(text) {
var url = 'https://your-embedding-service.example/api/match';
var payload = JSON.stringify({query: text, top_k: 5});
var opts = {method: 'post', contentType: 'application/json', payload: payload};
var resp = UrlFetchApp.fetch(url, opts);
return JSON.parse(resp.getContentText());
}
PRACTICAL HYBRID PATTERN WITH FLOOKUP
Use Flookup to reduce the candidate set and compute embeddings only for candidates. This balances cost and accuracy while keeping most workflow steps inside Google Sheets using functions such as NORMALIZE, FLOOKUP and ULIST. For uncertain matches use a manual review sheet or staged merge.
FURTHER READING
The notes below provide additional reading and resources for teams that wish to explore embeddings, indexing strategies and production deployment patterns.
- An Introduction to Fuzzy Matching Algorithms — background on classical approaches and where embeddings add value.
- Flookup AI documentation — details of the add-on's semantic matching and intelligent standardisation features.
- Flookup custom functions — reference for NORMALIZE, FLOOKUP, FUZZYSIM, ULIST and DEDUPE.
- Sentence-BERT (SBERT) — the research that underpins many sentence embedding models; useful for understanding semantic encoders.
- sentence-transformers (GitHub) — library and model hub; includes
all-MiniLM-L6-v2, a compact, fast encoder. - FAISS — a widely used library for efficient similarity search and clustering of dense vectors.
- Annoy — an alternative approximate nearest neighbour library optimised for memory efficiency.
- Milvus — vector database suited to production deployments with persistent storage and scale-out options.
- Hugging Face Transformers — model repository and deployment patterns for various transformer encoders.