News & Insights

What is Entity (Fuzzy) Matching?


fuzzy_matching_banner.jpg

No, it’s not an online dating strategy. Entity Matching is the ability to link records that have the same data in real life but may have a different spelling or formatting between the data sets being compared. Fuzzy matching refers to the technique or algorithms that make entity matching possible.

Usually records can be linked by using a shared unique identifier or primary key. Let’s start with how a primary key helps you link records and why linking records is important.

fuzzy_matching_link_merge.PNG

In the example above, a simple query can link and merge the two tables into one. This type of basic data management can be useful for performing analysis and generating reports.

Here, the “Account Number” is the primary key and allows you to match the table of names with the table of balances so you can create a more informative view.

Now, let’s look at another couple of data sets where the account number is missing from one of them, making it impossible to merge the tables using this field.

Here we have a simple table of names, and their account numbers for their mortgage accounts.

Now let’s say a mortgage lender was conducting due diligence on credit risk. The respondents did not provide account number, and the names are spelled differently for whatever reason. Some are typos, and others are legitimate variations of a name.

A simple query will not be able to join these two queries because:

  1. The second table does not have the “Account Number” field to link the two tables.

  2. The “Name” field contain different values, and the two tables share no matching values to link them.

While we (humans) can recognize that Yao Ming and Ming Yao are very likely the same person, the computer doesn’t have this intuition or ability to assess likelihood of a match without some instructions.

This is where the Entity Matching algorithms comes in – to teach the computer to match names that are not identical. We will get back to algorithms below, but let’s first explore this Entity Matching problem a little deeper.

So far, we only looked at names, and a handful of variations. Now imagine many more variations, dealing with phone numbers, addresses, etc. The Entity Matching problem can become complex and challenging to solve.

What Kind of Data Commonly Needs Fuzzy Matching?

Names: Different spelling of the same name is one of the most common reason for Fuzzy Matching. A sales rep could have caused typos at data entry. Or a data extraction technique like an Optical Character Recognition (OCR) could have misread some characters.

Don’t forget it does not have to be human names. A couple of examples are names of businesses and locations:

“McCafe” vs. “McCafé” and “Mississipi” vs. “Mississippi”.

Addresses: Another field prone to variations is addresses. Here’s a made-up address:

50-50 First Street, Fort Totten Boulevard, 11358

Now look at the same address, written a different way:

50-50 1st St. Ft Totten Blvd, 11358

The variations here are:

First vs. 1st
Street vs. St
A comma vs.  no comma but a period after abbreviation
Fort vs. Ft
Boulevard vs. Blvd

You can start to imagine the permutations of variation types, which can add up when there are hundreds and thousands of names and addresses.

Phone Numbers: If you have filled out your number on enough applications and account profiles, you probably noticed that the format is always different! For example

+1(234)4567890  vs. 234-456-7890

Again, there can be many variations of the same number, so if you are trying to match records by phone number, Fuzzy Matching is an indispensable technique.

With phone numbers, the variations are most likely in its format, and not the actual string or characters. We intuitively understand that a difference of just one number is likely a different phone number and not a typo.

What are Some Use Cases for Entity (Fuzzy) Matching?

Entity (Fuzzy) Matching can enable linking of records when you don’t have a primary key, but have some identifiers are vary between files. But why would you need to link these records in the first place ?

  1. Creating a single view (illustrated above)

    a) Merging tables with different data points to create a view that will provide more insight.

  2. Removing duplicates

    a) A sales staff moves a name from the “Prospect” list that to the “Current Client”, but forgets to remove the name from the “Prospect” list. Entity (Fuzzy) matching can help you identity these duplicates.

  3. Improving techniques such as OCR (mentioned above) or Natural Language Processing (NLP)

    a) Let’s say you use an OCR software to digitize handwritten notes. Entity (Fuzzy) matching can cross reference an existing dictionary to make sure the data extracted are real words, or even recommend the correct spelling.

How is Entity (Fuzzy) Matching Being Implemented?

The numerous potential variations for names, addresses, or phone numbers means that there is a large variety of approaches in dealing with the Entity Matching problem. Therefore, many algorithms were created to tackle this problem.

Here’s a short list of these algorithms posted on Medium:

  • Hamming Distance

  • Levenstein Distance

  • Damerau Levenstein Distance

  • N-Gram based string matching

  • BK Tree

  • Bitap algorithm

Why do we have so many algorithms? As noted above, there is a ton of different variations to account for, and different scenarios call for different approaches. The data manager usually uses a combination of these algorithms.

The data manager and the business unit can decide what the threshold will be for matching. For example, if you set up this threshold at 80%, and the algorithms generate a 75% score for a pair of values, this pair of values would be considered DIFFERENT. However, if the threshold is 70%, the 75% score would tell the data base that this pair of values is the SAME.

Try it yourself!

https://asecuritysite.com/forensics/simstring

Click on the link above, and type in “Jennifer” and “Jen”. You will see that the Smith-Waterman algorithm generates a score of 100. This algorithm is great for matching strings that are subsets of another string, and therefore great for matching shortened names or even abbreviations such as St for Street (also score of 100).

Now, try Mount Everest and Mt. Everest, or Fort Knox and Ft. Knox. You will see that the Jaro Winkler algorithm is best at matching these types of abbreviations (both pair getting scores of 90).

The implementation of Entity (Fuzzy) Matching will involve:

  1. Understanding your data

    a) Are you dealing with names, addresses, or phone numbers?

    b) Is the data coming from manual data-entry, in which case typos can be expected?

  2. Understanding the algorithms

    a) You (or your data engineers) need to also understand what algorithms will work best with your data

    b) Most likely, you will end up implementing a combination of these algorithms

  3. Establishing thresholds

    a) The lower the threshold, the more likely you will get false positives – there will be different values that get linked.

    b) The higher the threshold, the more likely you will get false negatives – there will be equivalent values that do NOT get linked

    c) Note that false negatives will be harder to remediate because they are not identified, whereas false positives are already identified, and a secondary check can call out potential false positives.

Dealing with False Positives and False Negatives

False Positive – this is the situation where the algorithm generates a relatively high match score for a couple of values that are in fact different identities.

Example: George W. Bush vs. George H. W. Bush

These are two different individuals, but some algorithms will assess the pair with relatively high matching score.

False Negative – this is the situation where the algorithm generates a relatively low match score, but the values in fact represent the same identity.

Example: Bill Clinton vs. William Jefferson Clinton

These two names belong to the same person, but most algorithms will generate a low matching score for these.

Manual intervention would be required to identify false positives and false negatives. It makes sense to run some models to see how many false positives and negatives are generated at various thresholds. For example, a high requirement e.g. 99%, would likely generate a lot of false negatives. On the other hand, if the threshold is low e.g. 60%, then there will be more false positives.

You will most likely deal with false positives better than false negatives because false positives are matched up and “identified”, whereas false negatives are not matched up and therefore “hidden”.

It also helps to have clean data in the first place. Dataladder noted that “The most effective method to minimize both false positives and negatives is to profile and clean the data sources separately before you conduct matching.”

About BaseCap

BaseCap helps organizations clean the data they rely on so they can be confident in using their data to drive business decisions, demonstrate regulatory compliance, provide seamless customer experience, etc.

BaseCap’s Data Quality Manager empowers teams to efficiently clean and validate data, which as mentioned above, reduces false positives and false negatives when performing Entity (Fuzzy) Matching.

Contact us now to see how our end-to-end automated data quality platform can improve your data quality.

TAGS

SHARE THIS ARTICLE