Making sense of OCR

In the project  I’m using for illustrating some of the capabilities of gcp, I need to make sense of a variety of documents, some very complex and some less so. The basic problem though is that there are documents of unknown structure (some with no structure to speak of), and from it I need to extract phone numbers, company names, people names, email addresses, professions, addresses and various other bits and pieces – and worse – find a way of associating them. To simplify the discussion, I’ll use this hand written document as a very small example of the problem.

APIS and platform

I’ll be using a collection of APIS throughout this series of articles, including
  • Google Vision – for OCR and simple classification
  • Google Natural Language – for entity analysis
  • Google Knowledge Graph – for gettimg more info about each entity
  • Google Cloud storage – for storing and tracking intermediate results, along with the More cloud streaming article where I covered how to efficiently stream to and from storage from Node
  • GraphQL to query my own industry specific API and mutate the results to a cockroachdb database all running in a Kubernetes cluster
  • Google Sheets API for examining and validating results, and Service account impersonation for Google APIS with Nodejs client to be able to do all that from node
  • Fuse.js for fuzzy matching
  • Google libphonenumber for phone number validation
  • Google cloud functions to run the
  • Lots and lots of code and some other minor APIS that I’ll mention when I get to them
This is a fairly involved project, to be written over a period, with the steps detailed below

Steps

Since G+ is closed, you can now star and follow post announcements and discussions on github, here