Using AWS services with OCRs (Textract and Tesseract.js) in Node.js

6 min readOct 29, 2020

How to use OCRs in an AWS lambda function and upload the results to S3.

In a recent job assignment, I needed to work with OCR services. I wanted to try and make it as external as possible to make our server as light as possible.
While working and trying to make it work I figured that most of the parts exist on the internet but they can not be found together. The research part took most of my time and I wanted to help other people like me.

All the code that you will see here can be found in my GitHub.

Prerequisites:

1. An AWS account (if you don’t have one, click here).
2. Basic knowledge in Node.js.

Now let’s get started with installing the modules we will use:
npm install aws-sdk tesseract.js dotenv file-type node-fetch lodash serverless serverless-dotenv-plugin download
I will explain the basic use for each module and when we use it in each code snippet.

AWS Textract

Textract is an OCR provided by Amazon AWS. Textract uses a unique structure for the result it provides us with. We get one big object type that contains a lot of different blocks and every block can contain other blocks.
Most of the code I use in my Textract came from a different Medium post called: “Extract text and data from any document using Amazon Textract in Node.js”. Hatem Alimam did an amazing job explaining everything, but I took only what I needed from his code.

In this code, I used the power of Textract to recognize key values pairs and extract them. I used process.env.*, more on that later on.

Tesseract.js OCR

Tesseract is a well known OCR in the Python community and Tesseract.js is a great and pure port of that OCR. The reason I chose to add Tesseract is because it supports 100+ languages and because of the big community supporting and using this OCR.
I used the tesseract.js for its powerful raw power to find a lot more words than other OCRs but the accuracy is not always the best.

I used one OCR for the key values pairs and one OCR for the whole text so I could compare both results.

I’m using the basic extract function that Tesseract.js gave an example of but with a little twist to save the cache in the/tmp folder. It is important to do so for the Lambda function.

The controller

I tried to build my code in a way that will be easy for me to add a new OCR, disable an existing one, or change the code without affecting everything.
First of all, I wanted to check if I’m using a local file or an external link. The difference is that we need to download an external file but not a local one.

I used the fetch module for getting the external link response,
the file-type module for getting the same extension of the file,
the download module for (surprise) to download the file
and finally fs for saving the file in /tmp (again for the Lambda function).

After getting the file ready for the OCRs I sent the file path to each OCR (Textract got the file in a stream) and uploaded the results to S3.

Uploading to S3

Amazon Simple Storage Service. A cloud object storage service, easy to use, and very simple to upload and to retrieve from. The code for uploading is straightforward, no need for much explaining.
the function S3.upload is getting a few basic parameters:
Bucket: the name of the folder in which we want to save the results,
Key: the name of the file,
ContentType: making sure the file is saved as a JSON file,
Body: the content of the file.

The .env file

We used it until now in different places but most of it was for the same purpose: the .env helped us reuse the same variable in different places around the project.
Here is an example of the one I used:

Serverless

I used serverless in my AWS program to help deploy my program quicker each time I changed it and needed to add stuff inside the lambda function. Serverless works perfectly fine with AWS and makes your life easier.
To use serverless we need to install it, which we did at the beginning of this post. Now we need to open the terminal and write in the main folder

serverless install

The serverless program will start running and install the additional stuff it needed to work. It will give us an example of a handler file and a serverless.yaml file.
The serverless.yaml file has a lot of configuration but here is an example of what we need:

Service: The name of the project, we will use it for the name of the Lambda function we will create.
plugin: Adding the additional stuff, can be custom made by the community.
We use it to create environment variables inside the lambda function with our own .env file.
frameworkVersion: The serverless version.
provider: The type of service we want to use serverless for.
function: The name of the main function and its location.
It will be used for the lambda main function.

And lastly, we add our own handler function to change and activate our service.

We want to start our OCR service through an SQS service which we will create shortly. We are using the MessageDeduplicationId that the SQS is given to create the name of the file.
Ideally, every file should have a unique name.

SQS

We want to work with a Simple Queue Service for helping us order everything. First of all, we want to send the message to the SQS so it will activate the OCR service:

As we can understand MessageBody contains the URL we want to send to the OCR, MessageDeduplicationId will be the name of the file in the S3 and MessageGroupId is less relevant (for what I use the SQS for).
In GitHub, this file contains 3 more lines that can be used for testing.

The next file is generating a link from the S3 so we can see it.
We generate a pre-signed link for 3 minutes.

Connect everything we did!

First, let’s deploy our code into the Lambda function.
Open the project and enter the code serverless deploy .
It can take a few minutes and when it's done it will open and order the Lambda function for you!

Second, let’s connect the SQS message as a trigger to the lambda function.
Go to your AWS Management Console: Services > Lambda.
Got to your newly created Lambda function.
Press on Add trigger: