This is the second step in Making sense of Ocr – an orchestration wrapper for the APIS we’ll need. In Google Vision and OCR I showed how to use the vision API to turn a pdf into a series of pages, blocks, paragraphs, and symbols all held together by bounding boxes. In other words, an expression of a map of the image below.

Reading the content

The Vision output will be a series of file in cloud storage. The objective is to assign approximate row and columns to all of that and to analyze what type of data is held in each cell. The structure of this app is


and it is executed like this

where the path is the one specifed at ocr time, and the country will be used for default localization (for example country code internaltionalization for phone numbers)

index.js

languageserver.js

anguageorchestrate.js

This process orchestrates a number of apis to try to assign meaning (a purpose) to each text item identified, each step enriching what we know about each text item.

languageText.js

Organizes the data throw proximity. It also uses a number of APIS to identify common formats for phone numbers, email addresses and so on, as well as looking up my own GraphQL API to recognizes application specific entities.
The difference between breaks in words and breaks in entities.
Telling the difference between “Mike London” and “Mike” from “London” is a typical problem here. We can’t use ‘columns’ because that concept doesn’t exist except through fuzzy matching of bouding box co-oridinates, and in any case a document may well contain multiple formats which would confuse any attempt to columnize. Luckily, Vision provides the concept of ‘breaks’- these are

Each word consists of symbols (the letters of the word) and sometime a break property which gives a clue as to whether this is a phrase containing a space, or a bigger space indictating a different entity.

The results of these APIs are ‘scored’ to come up with the most likely role for an entity – for example
a phone number

a name

a place

a profession

languageContent – uses the Google Natural Language API to do some entity analysis on text content. This can return an entity ‘mid’ which is a lookup key to the Google Knowledge Graph if an entity is known – for example “Sean Connery” has an entry in the Knowledge Graph)

See Making sense of Ocr for next steps

Related

This example is actual part of a much larger workflow which makes use of a range of APIS. If you found this useful, you may also like some of the pages here.
Since G+ is closed, you can now star and follow post announcements and discussions on github, here