COMBINE FUZZY MATCHING AND EMBEDDINGS IN GOOGLE SHEETS AND EXCEL

On this page

Why hybrid matching
High-level pattern
Google Sheets (Apps Script)
Excel (Office Script)
Performance and Cost Examples
ANN and Production Notes
UX and Review Workflows
Further reading

WHY HYBRID MATCHING

Classical fuzzy matching remains efficient for typographical errors and close variants; embeddings capture semantic similarity across phrasing, abbreviations and synonyms. Combining both reduces calls to embedding services while preserving precision. The hybrid approach is particularly suitable for spreadsheet audiences who require low‑friction, low‑cost integration.

HIGH-LEVEL PATTERN

Normalise text in the sheet (for example with NORMALIZE).
Use Flookup functions such as FLOOKUP or ULIST to produce a short candidate list per row.
Compute embeddings for candidates only and score with a semantic service.
Combine semantic score with a lexical check (for example FUZZYSIM) and apply thresholds to auto‑accept, reject or flag for review.

GOOGLE SHEETS (APPS SCRIPT)

The snippet below is a minimal, production‑aware Apps Script that batches candidate requests, calls a hypothetical embedding match endpoint and writes scores back to the sheet. Adapt the endpoint and authentication to the chosen provider.

function batchSemanticMatch() {
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheetByName('Matches');
  var data = sheet.getRange(2,1,sheet.getLastRow()-1,3).getValues(); // [id, query, candidates_json]

  var batchSize = 50;
  for (var i=0; i<data.length; i+=batchSize) {
    var batch = data.slice(i, i+batchSize);
    var payload = batch.map(function(r){
      return {id: r[0], query: r[1], candidates: JSON.parse(r[2])};
    });

    var resp = UrlFetchApp.fetch('https://your-embedding-service.example/api/match', {
      method: 'post',
      contentType: 'application/json',
      payload: JSON.stringify({items: payload, top_k:5}),
      muteHttpExceptions: true
    });

    if (resp.getResponseCode() !== 200) continue;
    var results = JSON.parse(resp.getContentText());
    results.forEach(function(r, idx){
      var row = i + idx + 2;
      sheet.getRange(row,4).setValue(JSON.stringify(r.matches)); // write matches JSON
      sheet.getRange(row,5).setValue(r.top_score);
    });
  }
}

Notes:

Batching reduces network overhead and avoids hitting per‑request rate limits.
Store and cache computed embeddings where records are static to avoid recomputation.
Use Schedule Functions or Apps Script triggers for periodic rechecks.

EXCEL (OFFICE SCRIPT)

Office Script can perform an equivalent flow in Excel Online. The following is a concise example using the Fetch API available in Office Scripts runtimes.

async function main(workbook: ExcelScript.Workbook) {
  const sheet = workbook.getWorksheet('Matches');
  const range = sheet.getRange('A2:C101'); // adjust range
  const values = range.getValues();

  const items = values.map(r => ({id: r[0], query: r[1], candidates: JSON.parse(r[2])}));

  const resp = await fetch('https://your-embedding-service.example/api/match', {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify({items: items, top_k: 5})
  });

  if (!resp.ok) return;
  const results = await resp.json();
  results.forEach((r, i) => {
    sheet.getRange(`D${i+2}`).setValue(JSON.stringify(r.matches));
    sheet.getRange(`E${i+2}`).setValue(r.top_score);
  });
}

Performance and Cost Examples

Worked example: 10,000 rows, naive embedding per row vs hybrid candidate approach.

Approach	Embedding calls	Estimate cost (@ $0.0005/call)
Naive (one embedding per row)	10,000	$5.00
Hybrid (FLOOKUP -> avg 20 candidates)	10,000 * 20 = 200,000	$100.00
Optimised hybrid (candidate blocking + caching -> avg 5 candidates)	50,000	$25.00

Interpretation: hybrid approaches can increase embedding calls per row if candidate lists are large; candidate reduction and caching are essential to keep costs manageable. The cost figures above are illustrative; adapt pricing to the chosen provider and model size.

ANN and Production Notes

For larger datasets persist an ANN index (FAISS, Annoy, Milvus) and precompute embeddings offline.
Use the spreadsheet flow for verification and human review, and push large‑scale deduplication tasks to batch services that update the sheet with results.
Monitor embedding stability — reindex or retrain if nearest‑neighbour behaviour changes over time.

UX and Review Workflows

Recommended columns in the review sheet:

candidate_score: semantic score from embedding service
lexical_score: FUZZYSIM or similar
decision: accept / reject / review
review_flag: boolean for manual triage