An Interview with Charles Jekal, CTO of Data Surge
In a world swimming in a sea of data records that are incomplete and constantly changing, entity resolution (ER) has become a foundational tool for making sense of it all. But doing it right is harder than it sounds. Below is an interview with entity resolution expert Charles Jekal, CTO of Data Surge, who explains how machine learning improves ER results and lowers the cost, and why off-the-shelf systems aren’t all they’re cracked up to be.
Welcome Charles, could you please introduce yourself to our readers?
Hello. I’m the CTO for Data Surge and am mainly responsible for setting the technical vision for the company. I have also been responsible for running our person-centric entity resolution projects for quite a while now.
What is person-centric entity resolution?
Person-centric entity resolution involves taking loads and loads of data on people and then de-duplicating the records so that you can get a clearer vision of who that person is and their most up-to-date information. This could include things like address and phone number, but many other attributes as well. The goal is to make sure you have as complete a view of a person as possible. I have a lot of intimate knowledge about how to solve these types of problems from a technical perspective.
To that point, what happens if you get it wrong? I imagine things can get ugly quick.
In general, there are two things that can happen.
The first scenario is that you end up combining data from two people who are not the same person, but you think they are.
The second is that you have one person represented with two different records. You would then treat them as two different people when they are really the same person.
Neither of these are good. As an example, in the first scenario, if you’re an insurance agent and you have duplicate records on the same person, you risk paying them for the same claim multiple times, thinking it’s two different people when it’s only one. And then, on the flip side, if you’re a police officer with a search warrant, you don’t want to kick down the wrong door and arrest someone because you’ve confused their identity with someone else. You can imagine that both scenarios can be a little scary.
While most entity resolution projects require the creation of a fixed set of rules to define the parameters for classification, I understand you’re also using some Machine Learning techniques. What are the advantages of this approach?
Rules-based ER products become complicated very quickly because you tend to add layers and layers of new rules, and then suddenly it is too hard to maintain. This creates repercussions on your entity resolution that you didn’t expect. Once your model grows that complex, if you need to make changes, you don’t know how to untangle it all. That can become a huge problem, and everyone’s very aware of this issue within the ER space.
Adding machine learning (ML) for the matching element of entity resolution is a real game changer. Because machine learning models can learn dynamically, this means your system will be able to keep up with changes through the model training itself over time, instead of needing a programmer to re-code a complex set of rules every time things change.
What makes entity resolution so challenging?
I think first and foremost, not having a clean data set to train on is going to be most people’s problem. Whether you’re implementing a rules based solution, or ML solution, ultimately, you need a way to evaluate how well you’re doing. It’s a needle in the haystack problem. There may be 50 records out there that contain important information about a person, but how do you know you’ve found all 50 records?
And if you have a labeled data set where some of the answers are already tagged for you, then you can measure how well your rules are doing, at least on the macro scale. But without that, it’s hard to even proceed with creating rules. So, then you’re required to hire experts that understand the data.
The second challenge is the variety of data and how incorrect any individual piece of data might be. Maybe you’ve bought data from somebody, and the middle name is incorrect, but all the other pieces of information are good. Or maybe they didn’t provide the phone number. Another data set might be missing names but have good phone numbers or email addresses. Taking all this incomplete information and piecing together a complete picture is tough. It’s just a technically difficult problem to solve.
And getting harder, right? In terms of the size and scope of it all.
It’s getting easier to acquire data, but it’s becoming much harder to manage it.
I understand that you’re doing entity resolution in real time. Is that something that’s useful for your customers?
Traditionally, entity resolution has been a very time consuming, expensive process to run. Getting it down to five minutes or so is a pretty big undertaking – and we’re able to do it in a streaming environment. We’ve done a lot of work in real time entity resolution, but of course, it should fit the use case.
For things like fraud detection, people typically like ‘fresh’ analytics. But for other use cases it may not be as critical. But aside from the data being more current, another benefit to real time is that it also becomes more affordable to run.
I wouldn’t have thought that. Doesn’t real time processing inhibit you from getting too large with your data sets?
No, it’s the inverse. It allows you to go larger and larger without more and more cost. It’s not as expensive because you’re not recomputing the same data over and over again, as you would with typical batch processing. The method we use only processes new or changed data, so it’s much more efficient and cost-effective.
What about the age old question of build vs buy? Does Data Surge have an off the shelf entity resolution solution, or are you in favor of building custom solutions for your clients?
What you’ll find is that most off-the-shelf ER products are rules-based. They’re supposed to be fast to implement, but a typical organization will easily spend up to a year or so fine tuning all the rules to get it working properly, and in less time we could custom build them a better solution.
This is because of our expertise. Most teams that undertake an entity resolution project will quickly train a model and deploy it and think everything is fine. But if they don’t have the sophistication and the maturity to know what to look out for, they’ll make a lot of mistakes and get bad resolutions.
Also, developing ER as a streaming data system is technically hard. But we save our customers significant amounts of money over the long term by incorporating a real-time streaming approach. As I mentioned earlier, when you use this approach you only update the records that are new. If only 10 records have changed or are new, you don’t need to run all your records again.
And because we custom build, this means we can design your system to do specialized things that you wouldn’t be able to do by buying something off the shelf. One of our clients reached out because their off-the-shelf ER solution lacked the flexibility to do what they needed it to do. These are just the realities of using a system that you buy off the shelf versus having a company like Data Surge design and build it to your specifications.