AI at Davinci: extracting information from documents with machine learning
We have trained machine learning algorithms to automatically extract important
data out of business documents like salary slips, purchase agreements and identity cards. So
far, this extraction process is mostly done with manually written rules (time consuming) or
even by hand (even more time consuming). The machine learning system builds a statistical
model of what kind of information it expects where, by analyzing a large set of examples. It
does this without any human intervention. It can then use this model to extract the desired
information from new documents. We will use interactive visualizations to
explain how this works.
Imagine you have a salary slip that contains information
that needs to be extracted and store in a database, like the name and birthdate of the
employee, who is the employer and how much money she makes. And now imagine there also is an
identity card, a bank statement and an employer statement, all containing information you
need to extract and store. And not just one of each, but hundreds of them, every day.

A first approach to tackle this problem could be to manually write a number of rules:
something along the lines of 'if you need the last name, look for the words "last name", and
then take the words below it', or 'if you need the cumulative salary, look for a number
after an €, and then take the highest one'. Problem is these documents come in an infinite
number of different layouts: sometimes the last name is to the left of the words "employee
name" instead of below "last name", and sometimes the cumulative salary has no € at all. On
top of this, the structure of these documents changes over time: a rule we spend a lot of
manual work on today might be useless next month.
While you can try to solve all of these obstacles with even more rules (like we have been
doing for years), this feels an awful lot like Tantalus' torture; you try to improve the
rules until eternity, always having the feeling you're almost there, but never reaching that
goal of 100% of all important data correctly extracted. With the
recent
rise of
highly
successful machine learning algorithms, we
therefore have been experimenting with a new approach where we only show the machine
learning algorithm a large number of example documents where humans have indicated the
important information. The task of the algorithms then is to figure out how it can extract
the same information in unseen documents fully automatically.
There are many different machine learning algorithms and ways to use them. After various
experiments we have settled on a system that combines two approaches:
recognizing
content and
recognizing position.
Approach 1: recognizing content
The first approach, and perhaps the most obvious one, classifies every item (=word, number,
date, etc) on a document based on its characters. During the training phase, we provided the
algorithm with several hundreds of examples of items of the type we're interested in, and
several thousands of examples of other items. This is a small part of the input used to
teach the system to recognize birthdates:
Birthdate? |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
y |
1 |
5 |
/ |
0 |
2 |
/ |
1 |
9 |
8 |
6 |
y |
3 |
0 |
- |
0 |
3 |
- |
1 |
9 |
5 |
8 |
y |
2 |
5 |
|
0 |
5 |
|
1 |
9 |
8 |
9 |
y |
0 |
8 |
|
1 |
2 |
|
1 |
9 |
7 |
6 |
y |
0 |
4 |
. |
1 |
0 |
. |
1 |
9 |
6 |
3 |
n |
1 |
. |
4 |
8 |
5 |
, |
6 |
0 |
|
|
n |
f |
u |
l |
l |
t |
i |
m |
e |
|
|
n |
d |
a |
t |
u |
m |
|
|
|
|
|
n |
s |
a |
l |
a |
r |
i |
s |
s |
p |
e |
n |
l |
o |
o |
n |
h |
e |
f |
f |
i |
n |
The task of the machine learning algorithm is to reproduce whether there should be 'y' or 'n'
in the first column, based on the other columns. To do this, it will have to figure out
which characters on which position are good indicators of birthdates. The system could for
example learn that if you find '19' or '20' on position 7 and 8 it is highly likely you are
looking at a date.
The best performing algorithm for this task turned out be
k nearest neighbours. A
statistical model of birthdates was built in your browser while you were reading the first
paragraphs, so you can now play with the results. The percentage indicates how confident the
algorithm is the text in the textbox is a birthdate.
Note how this system is not a simple date recognizer but is really tuned to identify
birthdates:
01-02-1999 is perfectly well possible as a date, it's just not a very
likely birthdate for an employee (at least not in our dataset), and it thus gets a low
score. This makes it possible to separate birthdates from other dates on documents, like the
current date and start date of employment. You can clearly see this if we look at the
confidence values for all items on a number of example documents in the interactive
visualization below. Besides the k nearest neighbours algorithm you can also see the results
of two other algorithms.
Algorithm confidence for recognizing birthdays on salary slips, using
only the letters, numbers and punctation in the item itself. Brighter colors mean higher
confidence.
In most cases kNN is the most confident of the correct answer, whereas the Multilayer
Perceptron (a neural network) also triggers on more things that contain numbers.
Interestingly, when the division between birth dates and employee start dates gets blurry all
algorithms start to make mistakes. A few examples of false positives and false negatives of
this system are
12-06-1995,
22/02/1991,
12 JAN 1997 and
05 09
1988. These are all dates that could both be birthdates for young employees or
start dates for slightly older loyal employees. To separate these, we had to use a second
approach that looks at the positions and relations between items instead of just their
content.
Approach 2: recognizing position
Our second approach is to identify useful labels in a document and look around them. This is
most likely how humans do it too: if you want to know which date on a salary slip is the
birthdate, it's probably more convenient to look for the label 'birthdate' and use the date
next to it (instead of first scanning the whole document for dates). For this approach, we
need a way to automatically figure out what labels are useful. We have settled on a system
that looks in four directions around the item of interest; these are some examples of what
you find if you look around birthdates:
Direction |
Found labels |
Top |
geb. geb. geboortedatum geboortedatum 170454319 bijz.tarief geb.datum geb.datum |
Left |
geboortedatum geboortedatum geboortedatum geboortedatum geboortedatum geb.datum geboorte-datum geboorte-datum geboortedatum: geboortedatum: |
Bottom |
dienst dienst dienst dienst dienst dienst dienst ongehuwd ongehuwd ongehuwd |
Right |
ln ln ln dagen dagen recht recht recht stam stam |
You can see there are words in this list that are clearly useful (like 'geboortedatum' and
'geb.datum', meaning 'birthdate'), but also words we did not expect, like the fact that
apparently the words 'dienst' ('employment') and 'ongehuwd' ('unmarried') are often below
birthdates. While this is not something that we would typically use in a handwritten rule to
identify birthdates, it might still be a useful hint for a machine learning algorithm.
Another thing we noticed is that there are a lot of duplicate items and items that are very
similar but not identical. To group similar items together, we used the cluster algorithm
DBScan. This is an example of what the result of such a clustering can be, where larger
circles indicate items that occur more often:
You see that a lot of items with similar meanings get grouped together, like all ways to
write 'geboortedatum'. Note that it is not flawless, however: the unrelated words 'recht'
('law') and 'geslacht' ('gender') are grouped together because their final three letters are
identical.
The next step is to see if these grouped labels can be used to find the original entity we
are interested in. We tested the same set of algorithms as with the content based approach.
This time, the Multilayer Perceptron (a neural network) was the most successful.
Algorithm confidence for recognizing birthdays, using labels identified
as potentially useful earlier. Useful labels are indicated in yellow, brighter colors
mean higher confidence.
Like the content based approach, this position based approach is not perfect. A common
mistake is that it often does not know whether it should look left or below a label (like in
document 4 and 5)... which of course makes sense, because with only information on
positions, there is no way the algorithm could have known that the item below 'birthdate'
looks nothing like a birthdate.
Combining the two approaches
Summarizing, the content based approach makes mistakes that could be solved by looking at the
labels around it, and the position based approach makes mistakes that could be solved by
looking at the content. A logical next step is to combine both approaches. This is less
trivial than it may sound: a confidence score of 0.9 may be normal for one algorithm and
really high for the next. In the end, we settled on a ranking based normalization that seems
to return the correct entity 98% of the times. Remaining errors are mostly due to (1)
unusual document layouts that only occur once in the dataset and (2) image->text (OCR)
mistakes already introduced during preprocessing.
A major remaining challenge we are currently working on is that this combination of
approaches will not work for every entity type. For example, account numbers on salary slips
are often not accompanied by a label, while the job description does not have clearly
recognizable content.
What's next
Besides the challenge explained above, we are currently working to get this system running in
the cloud (using
Amazon Sagemaker). Here it
will run parallel to our main rule based system, so we can test how it performs on real time
production data in the cloud. An interesting advantage of this data is that it is manually
corrected by humans, and gets updated continuously with thousands of new documents, each and
every day. This means that by regular retraining we can ensure that the system automatically
stays up to date, both in terms of document layout and what our customers expect. This way,
we are another step closer to our ultimate endgoal: a system where our customers can
manually start labeling some new piece of information they are interested in, and our AI
system slowly takes over.
For more information, contact
Wessel Stoop.
Interested in working with us? Check our
open positions.