Solutions to Fuzzy Matching in NLP

Newt Tan
3 min readNov 22, 2020
natural language processing

There are always a lot of stuff happening in NLP. Especially in the industrial world.

I met a problem when I was trying to match a phrase in a sentence. Sometimes, the scenario would be like:

'tom and alice' --- > 'tom is the soulmate of alice, ...'

If you want to match, or kind of extract the core information or entity in that sentence. Here is what fuzzy matching should play a role. In this article, I will introduce two methods, for which are based on spaczz and fuzzywuzzy.

In NLP world, the open tools I found which are quite popular must be textacy and spacy. Of course, textacy is the after product from spacy. I will use them to tokenize the sentence than use basic nltk modules.

1. Spaczz Match

The spaczz is a compatible library for spacy, which is quite convenient when processing fuzzy matching. The principle is to add some patterns to the SpaczzRuler, for which you do not worry about the length of your tokens.

spaczz match

If you do not care about the accuracy or speed, you can definitly choose it, even it has been good enough.

2. FuzzyWuzzy Match

When it comes to the third part fuzzy matching library, fuzzywuzzy is what I found is quite common. But a lot of problems came out when I used it.

The most tricky one is what to match: span or token or noun_chunks.

Cuz the common units when you process any sentences would be spans and tokens. In addition, we should not ignore the noun_chunks.

Here I want to give a little explanation for these concepts:

  • a Token represents a single word, punctuation symbol, whitespace, etc. from a document, while a Span is a slice from the document. In other words, a Span is an ordered sequence of Tokens.
  • noun_chunks are more like a sequence tokens (nouns) representing the complete meanings, which means every noun_chunk would be compose of more then just one token.

In fuzzy matching process, a question came into my mind was to match the tokens or spans (which are composed of ents + noun_chunks) set with a given word. They are generated in slightly different ways:

For tokens with textacy tool:

spacy_lang = textacy.load_spacy_lang("en_core_web_sm")                                                                   docx_textacy = spacy_lang(sentence)                       
tokens = to_tokenized_text(docx_textacy)

While spans can be generated by this way:

...spans = list(docx_textacy.ents) + list(docx_textacy.noun_chunks)

Of course, I do not know the answer. What I am willing to achieve it to find the matched token/span with relatively high related information.

For fuzzy matching, I used the fuzzywuzzy library. The function i used in this test is extractOne. Basically, this function will pass two parameters with (word, word_list). It will compute the similarity ratio for each word in the word_list. And give the word with the highest ratio for the given word.

In the first two test, we can see obviously that the spans can given more information than tokens.

But when it comes to the third one. The tokens seem to contain more crucial information. In the meantime, we can conduct every token in tokens list, if there is one word matched, the ratio will be 90. (I verified it as well.)

The difference would be more representative. In order to match, the best way is to define the ratio threshold. From the little volume of test, we know that it would be 90 for tokens and 86 for spans.

In conclusion, the result is quite tricky. If we choose the tokens, we can have the precise result while less information. But we will get more information by spans with the more chance to mislead the crucial meanings.

If anyone has more useful information or solution, leave your comments. Willing to talk and discuss.

Thanks for your time and reading. Have a nice day.

--

--

Newt Tan

In the end, the inventor is still the hero and always will be. Don’t give up on your dreams. We started with DVDs.