COMBINE FUZZY MATCHING AND EMBEDDINGS IN GOOGLE SHEETS AND EXCEL
WHY HYBRID MATCHING
Classical fuzzy matching remains efficient for typographical errors and close variants; embeddings capture semantic similarity across phrasing, abbreviations and synonyms. Combining both reduces calls to embedding services while preserving precision. The hybrid approach is particularly suitable for spreadsheet audiences who require low‑friction, low‑cost integration.
HIGH-LEVEL PATTERN
- Normalise text in the sheet (for example with NORMALIZE).
- Use Flookup functions such as FLOOKUP or ULIST to produce a short candidate list per row.
- Compute embeddings for candidates only and score with a semantic service.
- Combine semantic score with a lexical check (for example FUZZYSIM) and apply thresholds to auto‑accept, reject or flag for review.
GOOGLE SHEETS (APPS SCRIPT)
The snippet below is a minimal, production‑aware Apps Script that batches candidate requests, calls a hypothetical embedding match endpoint and writes scores back to the sheet. Adapt the endpoint and authentication to the chosen provider.
function batchSemanticMatch() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheetByName('Matches');
var data = sheet.getRange(2,1,sheet.getLastRow()-1,3).getValues(); // [id, query, candidates_json]
var batchSize = 50;
for (var i=0; i<data.length; i+=batchSize) {
var batch = data.slice(i, i+batchSize);
var payload = batch.map(function(r){
return {id: r[0], query: r[1], candidates: JSON.parse(r[2])};
});
var resp = UrlFetchApp.fetch('https://your-embedding-service.example/api/match', {
method: 'post',
contentType: 'application/json',
payload: JSON.stringify({items: payload, top_k:5}),
muteHttpExceptions: true
});
if (resp.getResponseCode() !== 200) continue;
var results = JSON.parse(resp.getContentText());
results.forEach(function(r, idx){
var row = i + idx + 2;
sheet.getRange(row,4).setValue(JSON.stringify(r.matches)); // write matches JSON
sheet.getRange(row,5).setValue(r.top_score);
});
}
}
Notes:
- Batching reduces network overhead and avoids hitting per‑request rate limits.
- Store and cache computed embeddings where records are static to avoid recomputation.
- Use Schedule Functions or Apps Script triggers for periodic rechecks.
EXCEL (OFFICE SCRIPT)
Office Script can perform an equivalent flow in Excel Online. The following is a concise example using the Fetch API available in Office Scripts runtimes.
async function main(workbook: ExcelScript.Workbook) {
const sheet = workbook.getWorksheet('Matches');
const range = sheet.getRange('A2:C101'); // adjust range
const values = range.getValues();
const items = values.map(r => ({id: r[0], query: r[1], candidates: JSON.parse(r[2])}));
const resp = await fetch('https://your-embedding-service.example/api/match', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({items: items, top_k: 5})
});
if (!resp.ok) return;
const results = await resp.json();
results.forEach((r, i) => {
sheet.getRange(`D${i+2}`).setValue(JSON.stringify(r.matches));
sheet.getRange(`E${i+2}`).setValue(r.top_score);
});
}
Performance and Cost Examples
Worked example: 10,000 rows, naive embedding per row vs hybrid candidate approach.
| Approach | Embedding calls | Estimate cost (@ $0.0005/call) |
|---|---|---|
| Naive (one embedding per row) | 10,000 | $5.00 |
| Hybrid (FLOOKUP -> avg 20 candidates) | 10,000 * 20 = 200,000 | $100.00 |
| Optimised hybrid (candidate blocking + caching -> avg 5 candidates) | 50,000 | $25.00 |
Interpretation: hybrid approaches can increase embedding calls per row if candidate lists are large; candidate reduction and caching are essential to keep costs manageable. The cost figures above are illustrative; adapt pricing to the chosen provider and model size.
ANN and Production Notes
- For larger datasets persist an ANN index (FAISS, Annoy, Milvus) and precompute embeddings offline.
- Use the spreadsheet flow for verification and human review, and push large‑scale deduplication tasks to batch services that update the sheet with results.
- Monitor embedding stability — reindex or retrain if nearest‑neighbour behaviour changes over time.
UX and Review Workflows
Recommended columns in the review sheet:
- candidate_score: semantic score from embedding service
- lexical_score: FUZZYSIM or similar
- decision: accept / reject / review
- review_flag: boolean for manual triage
FURTHER READING
- Semantic fuzzy matching with embeddings
- Flookup custom functions
- Flookup AI documentation
- An Introduction to Fuzzy Matching Algorithms