SEMANTIC FUZZY MATCHING WITH EMBEDDINGS

Tags: fuzzy-matching flookup embeddings
On this page

OVERVIEW

Traditional string similarity measures work well for typographical errors and simple variants. They are less reliable when meaning is the principal signal. Embeddings encode text as vectors so that semantically similar values appear close together. This improves recall for deduplication and matching tasks when records use different phrasing, abbreviations or paraphrases.

WHEN TO USE SEMANTIC MATCHING

RECOMMENDED PIPELINE

  1. Normalise text with NORMALIZE.
  2. Generate candidates using Flookup functions such as FLOOKUP, ULIST or DEDUPE.
  3. Compute embeddings and index with an ANN library (for example FAISS or Annoy).
  4. Verify candidates with semantic scores plus a secondary lexical check; surface uncertain cases for review.

PYTHON EXAMPLE (SENTENCE-TRANSFORMERS + FAISS)

This minimal example demonstrates building an index and querying it. For production add batching, persistence and monitoring.

pip install sentence-transformers faiss-cpu

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

records = [
  'Acme Corporation',
  'Acme Corp.',
  'Acme, Inc',
  'Apple Inc',
  'New York City'
]

embs = model.encode(records, convert_to_numpy=True)

# normalise and index for cosine similarity
faiss.normalize_L2(embs)
d = embs.shape[1]
index = faiss.IndexFlatIP(d)
index.add(embs)

query = 'Acme Corporation'
q_emb = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(q_emb)
distances, indices = index.search(q_emb, k=5)

for score, idx in zip(distances[0], indices[0]):
    print(score, records[idx])

VERIFICATION AND THRESHOLDS

Embeddings increase recall but require careful verification to avoid false positives. Use a conservative similarity threshold (for example 0.75–0.9), add a secondary lexical test such as FUZZYSIM, and route borderline matches into a review queue.

GOOGLE SHEETS INTEGRATION

Keep workflows in the spreadsheet by combining Flookup and a lightweight external service. Pattern:

  1. Normalise and produce candidates in the sheet.
  2. Call an external matching endpoint from Apps Script for candidate verification only.
  3. Store scores and decisions back in the sheet and schedule rechecks with Schedule Functions.
function getSemanticMatches(text) {
  var url = 'https://your-embedding-service.example/api/match';
  var payload = JSON.stringify({query: text, top_k: 5});
  var opts = {method: 'post', contentType: 'application/json', payload: payload};
  var resp = UrlFetchApp.fetch(url, opts);
  return JSON.parse(resp.getContentText());
}

PRACTICAL HYBRID PATTERN WITH FLOOKUP

Use Flookup to reduce the candidate set and compute embeddings only for candidates. This balances cost and accuracy while keeping most workflow steps inside Google Sheets using functions such as NORMALIZE, FLOOKUP and ULIST. For uncertain matches use a manual review sheet or staged merge.

FURTHER READING

The notes below provide additional reading and resources for teams that wish to explore embeddings, indexing strategies and production deployment patterns.