AI at Davinci: extracting information from documents with machine learning

We have trained machine learning algorithms to automatically extract important data out of business documents like salary slips, purchase agreements and identity cards. So far, this extraction process is mostly done with manually written rules (time consuming) or even by hand (even more time consuming). The machine learning system builds a statistical model of what kind of information it expects where, by analyzing a large set of examples. It does this without any human intervention. It can then use this model to extract the desired information from new documents. We will use interactive visualizations to explain how this works.

Imagine you have a salary slip that contains information that needs to be extracted and store in a database, like the name and birthdate of the employee, who is the employer and how much money she makes. And now imagine there also is an identity card, a bank statement and an employer statement, all containing information you need to extract and store. And not just one of each, but hundreds of them, every day.

A first approach to tackle this problem could be to manually write a number of rules: something along the lines of 'if you need the last name, look for the words "last name", and then take the words below it', or 'if you need the cumulative salary, look for a number after an €, and then take the highest one'. Problem is these documents come in an infinite number of different layouts: sometimes the last name is to the left of the words "employee name" instead of below "last name", and sometimes the cumulative salary has no € at all. On top of this, the structure of these documents changes over time: a rule we spend a lot of manual work on today might be useless next month. While you can try to solve all of these obstacles with even more rules (like we have been doing for years), this feels an awful lot like Tantalus' torture; you try to improve the rules until eternity, always having the feeling you're almost there, but never reaching that goal of 100% of all important data correctly extracted. With the recent rise of highly successful machine learning algorithms, we therefore have been experimenting with a new approach where we only show the machine learning algorithm a large number of example documents where humans have indicated the important information. The task of the algorithms then is to figure out how it can extract the same information in unseen documents fully automatically. There are many different machine learning algorithms and ways to use them. After various experiments we have settled on a system that combines two approaches: recognizing content and recognizing position.

Approach 1: recognizing content

The first approach, and perhaps the most obvious one, classifies every item (=word, number, date, etc) on a document based on its characters. During the training phase, we provided the algorithm with several hundreds of examples of items of the type we're interested in, and several thousands of examples of other items. This is a small part of the input used to teach the system to recognize birthdates:

Birthdate?	1	2	3	4	5	6	7	8	9	10
y	1	5	/	0	2	/	1	9	8	6
y	3	0	-	0	3	-	1	9	5	8
y	2	5		0	5		1	9	8	9
y	0	8		1	2		1	9	7	6
y	0	4	.	1	0	.	1	9	6	3
n	1	.	4	8	5	,	6	0
n	f	u	l	l	t	i	m	e
n	d	a	t	u	m
n	s	a	l	a	r	i	s	s	p	e
n	l	o	o	n	h	e	f	f	i	n

The task of the machine learning algorithm is to reproduce whether there should be 'y' or 'n' in the first column, based on the other columns. To do this, it will have to figure out which characters on which position are good indicators of birthdates. The system could for example learn that if you find '19' or '20' on position 7 and 8 it is highly likely you are looking at a date. The best performing algorithm for this task turned out be k nearest neighbours. A statistical model of birthdates was built in your browser while you were reading the first paragraphs, so you can now play with the results. The percentage indicates how confident the algorithm is the text in the textbox is a birthdate.

kNN birthday recognizer

Examples

Note how this system is not a simple date recognizer but is really tuned to identify birthdates: 01-02-1999 is perfectly well possible as a date, it's just not a very likely birthdate for an employee (at least not in our dataset), and it thus gets a low score. This makes it possible to separate birthdates from other dates on documents, like the current date and start date of employment. You can clearly see this if we look at the confidence values for all items on a number of example documents in the interactive visualization below. Besides the k nearest neighbours algorithm you can also see the results of two other algorithms.

100%

Example document

Algorithm

Algorithm confidence for recognizing birthdays on salary slips, using only the letters, numbers and punctation in the item itself. Brighter colors mean higher confidence.

In most cases kNN is the most confident of the correct answer, whereas the Multilayer Perceptron (a neural network) also triggers on more things that contain numbers. Interestingly, when the division between birth dates and employee start dates gets blurry all algorithms start to make mistakes. A few examples of false positives and false negatives of this system are 12-06-1995, 22/02/1991, 12 JAN 1997 and 05 09 1988. These are all dates that could both be birthdates for young employees or start dates for slightly older loyal employees. To separate these, we had to use a second approach that looks at the positions and relations between items instead of just their content.

Approach 2: recognizing position

Our second approach is to identify useful labels in a document and look around them. This is most likely how humans do it too: if you want to know which date on a salary slip is the birthdate, it's probably more convenient to look for the label 'birthdate' and use the date next to it (instead of first scanning the whole document for dates). For this approach, we need a way to automatically figure out what labels are useful. We have settled on a system that looks in four directions around the item of interest; these are some examples of what you find if you look around birthdates:

Direction	Found labels
Top	geb. geb. geboortedatum geboortedatum 170454319 bijz.tarief geb.datum geb.datum
Left	geboortedatum geboortedatum geboortedatum geboortedatum geboortedatum geb.datum geboorte-datum geboorte-datum geboortedatum: geboortedatum:
Bottom	dienst dienst dienst dienst dienst dienst dienst ongehuwd ongehuwd ongehuwd
Right	ln ln ln dagen dagen recht recht recht stam stam

You can see there are words in this list that are clearly useful (like 'geboortedatum' and 'geb.datum', meaning 'birthdate'), but also words we did not expect, like the fact that apparently the words 'dienst' ('employment') and 'ongehuwd' ('unmarried') are often below birthdates. While this is not something that we would typically use in a handwritten rule to identify birthdates, it might still be a useful hint for a machine learning algorithm. Another thing we noticed is that there are a lot of duplicate items and items that are very similar but not identical. To group similar items together, we used the cluster algorithm DBScan. This is an example of what the result of such a clustering can be, where larger circles indicate items that occur more often:

You see that a lot of items with similar meanings get grouped together, like all ways to write 'geboortedatum'. Note that it is not flawless, however: the unrelated words 'recht' ('law') and 'geslacht' ('gender') are grouped together because their final three letters are identical. The next step is to see if these grouped labels can be used to find the original entity we are interested in. We tested the same set of algorithms as with the content based approach. This time, the Multilayer Perceptron (a neural network) was the most successful.

100%

Example document

Algorithm

Algorithm confidence for recognizing birthdays, using labels identified as potentially useful earlier. Useful labels are indicated in yellow, brighter colors mean higher confidence.

Like the content based approach, this position based approach is not perfect. A common mistake is that it often does not know whether it should look left or below a label (like in document 4 and 5)... which of course makes sense, because with only information on positions, there is no way the algorithm could have known that the item below 'birthdate' looks nothing like a birthdate.

Combining the two approaches

Summarizing, the content based approach makes mistakes that could be solved by looking at the labels around it, and the position based approach makes mistakes that could be solved by looking at the content. A logical next step is to combine both approaches. This is less trivial than it may sound: a confidence score of 0.9 may be normal for one algorithm and really high for the next. In the end, we settled on a ranking based normalization that seems to return the correct entity 98% of the times. Remaining errors are mostly due to (1) unusual document layouts that only occur once in the dataset and (2) image->text (OCR) mistakes already introduced during preprocessing. A major remaining challenge we are currently working on is that this combination of approaches will not work for every entity type. For example, account numbers on salary slips are often not accompanied by a label, while the job description does not have clearly recognizable content.

What's next

Besides the challenge explained above, we are currently working to get this system running in the cloud (using Amazon Sagemaker). Here it will run parallel to our main rule based system, so we can test how it performs on real time production data in the cloud. An interesting advantage of this data is that it is manually corrected by humans, and gets updated continuously with thousands of new documents, each and every day. This means that by regular retraining we can ensure that the system automatically stays up to date, both in terms of document layout and what our customers expect. This way, we are another step closer to our ultimate endgoal: a system where our customers can manually start labeling some new piece of information they are interested in, and our AI system slowly takes over. For more information, contact Wessel Stoop. Interested in working with us? Check our open positions.

Birthdate?	1	2	3	4	5	6	7	8	9	10
y	1	5	/	0	2	/	1	9	8	6
y	3	0	-	0	3	-	1	9	5	8
y	2	5		0	5		1	9	8	9
y	0	8		1	2		1	9	7	6
y	0	4	.	1	0	.	1	9	6	3
n	1	.	4	8	5	,	6	0
n	f	u	l	l	t	i	m	e
n	d	a	t	u	m
n	s	a	l	a	r	i	s	s	p	e
n	l	o	o	n	h	e	f	f	i	n

Birthdate?	1	2	3	4	5	6	7	8	9	10
y	1	5	/	0	2	/	1	9	8	6
y	3	0	-	0	3	-	1	9	5	8
y	2	5		0	5		1	9	8	9
y	0	8		1	2		1	9	7	6
y	0	4	.	1	0	.	1	9	6	3
n	1	.	4	8	5	,	6	0
n	f	u	l	l	t	i	m	e
n	d	a	t	u	m
n	s	a	l	a	r	i	s	s	p	e
n	l	o	o	n	h	e	f	f	i	n