Rough Matching

When dealing with matching in sheets, you sometimes need to get close matches. This post shares the “Rough” namespace of the cUseful library, available here.

Some examples

Let’s start with this sheet – a Sean Connery filmography. The objective is to find rows that roughly match some text, to be provided later.

I’m using fiddler to handle sheet interactions because it makes manipullation so much easier. You can of course access your sheet in other ways, but I won’t cover that here, so let’s start by getting the sheet contents and objectifying.

Now we need to set up rough matching. I’m going to set up a reference list using the trackName column – but this could be any list or combination of column data. Matching of any sort can return more than 1 result. The rough matcher assigns a value of 0-1, and returns the results in order, with the most likely match first. I’m setting the minimum score level to 0.3, so anything below that score will be discarded.

Testing

We’re now ready to go – here’s a series of rough matches, with the most likely result logged out

the results

Umlats and accents

Many languages have accents in the spelling. These are dealt with too.

result

Getting matching rows

Often (as with vlookup), you’ll want to get the matching value from another part of the matched row. Rough match returns an index that can be used to index into the original data that was used to set up the reference list.

result

Multiple results

Here’s how you would get the top 3 results, with their scores

result

Options

There are plenty of options to tweak the algorithm. It’s fairly easy to figure out what they do so I’ll leave you to play. You pass tweaked options when you .init() the Rough object. (In the example earlier, I tweaked the min value). Here are all the available options at time of writing.

Stemming

This refers to changing words to their stem – for example director, direction, directors, directing etc all have the same stem. Stemming is a common process to normalize text for input to AI.

Stubbing

This is a technique to simplify the spelling of the stems – so director becomes drctr and so on – which removes most common spelling mistakes.

UnAccent

Removes accents and umlats.

Stopwords

This removes non valuable words like “with”, “and” etc from consideration.

Language

This selects the language to use. This mainly applies to stopwords. Currently only English and French are supported – but it’s easy to add others.

Extrastops

You can add other non-value add words of your own – perhaps your domain contains specific words that are often used but won’t help with matching differentiation.

Scores

These are used by the algorithm to weight the results.

Most of these processes are provided as standalone namespaces in cUseful – for example if you just want to do stemming for some other purpose, you can use the Stemming workspace

Matching across sheets.

Here’s a full example – where the first and last name are provided in fairly poor shape, but get roughly matched as follows.

The code

For more like this see Google Apps Scripts Snippets
For help and more information join our community, follow the blog or follow me on Twitter