Extracting text from images with Tesseract OCR, OpenCV, and Python

In the end we will see, it can be concluded that Tesseract is perfect for scanning clean documents and you can easily convert 

the image’s text from OCR to word, 

pdf to word, 

or 

to any other required format. 

It has pretty high accuracy and font variability. This is very useful in case of institutions where a lot of documentation is involved such as government offices, hospitals, educational institutes, etc. In the current release 4.0, Tesseract supports OCR based deep learning that is significantly more accurate. You can access the code file and input image here to create your own OCR task. Try replicating this task and achieve the desirable results, happy coding!

Coding

Here, I will use the following sample receipt image:



First part is image thresholding. Following is the code that you can use for thresholding:


pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'  # your path may be different

For Windows Only

1 - You need to have Tesseract OCR installed on your computer.

get it from here. https://github.com/UB-Mannheim/tesseract/wiki

Download the suitable version.

2 - Add Tesseract path to your System Environment. i.e. Edit system variables.

3 - Run pip install pytesseract and pip install tesseract

4 - Add this line to your python script every time

pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe
' # your path may be different

5 - Run the code.


===================================================


By default, Tesseract considers the input image as a page of text in segments. You can configure Tesseract’s different segmentations if you are interested in capturing a small region of text from the image. You can do it by assigning –psm mode to it. Tesseract fully automates the page segmentation but it does not perform orientation and script detection. The different configuration parameters for Tesseract are mentioned below:

Page Segmentation Mode (–psm): By configuring this, you can assist Tesseract in how it should split an image in the form of texts. The command-line help has 11 modes. You can choose the one that works best for your requirement from the table given below:

modeWorking description
0psmOrientation and script detection (OSD) only
1Automatic page segmentation with OSD
2Automatic page segmentation, but no OSD, or OCR
3Fully automatic page segmentation, but no OSD (Default)
4Presume a single column of text of variable sizes
5Assume a single uniform block that has a vertically aligned text
6Assume a single uniform block of text
7Treat the image as a single text line
8Treat the image as a single word
9To treat the image as a single word in a circle
10Treat the image as a single character
11Sparse text. Find as much text as possible not in a particular order
12Sparse text with OSD
13Raw line. Treat the image as a single text line, bypass hack by Tesseract-specific.

Engine Mode (–oem): Tesseract has several engine modes with different performance and speed. Tesseract 4 has introduced an additional LSTM neutral net mode that works the best. Follow the table given below for different OCR engine modes:

OCR engine modeWorking description
0Legacy engine only
1Neural net LSTM only
2Legacy + LSTM mode only
3By Default, based on what is currently available
# Import required packages 
import cv2
import pytesseract
import csv

pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe' # your path may be different

# Read image from which text needs to be extracted
from pytesseract import Output

img = cv2.imread(r'G:\PARAS\anuradha.jpeg')

# Preprocessing the image starts

# Convert the image to gray scale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# converting it to binary image by Thresholding
# this step is require if you have colored image because if you skip this part
# then tesseract won't able to detect text correctly and this will give incorrect result
# Performing OTSU threshold
threshold_img = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
cv2.imshow('threshold image', threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()








# configuring parameters for tesseract (Now, you can see the difference between the original image and the thresholded image. The thresholded image shows a clear separation between white pixels and black pixels. Thus, if you deliver this image to Tesseract,
# it will easily detect the text region and will give more accurate results. To do so, follow the commands given below:)
custom_config = r'--oem 3 --psm 6'
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
print(details.keys())

total_boxes =
len(details['text'])
for sequence_number in range(total_boxes):
if int(details['conf'][sequence_number]) > 30:
(x, y, w, h) = (
details[
'left'][sequence_number], details['top'][sequence_number], details['width'][sequence_number],
details[
'height'][sequence_number])
threshold_img = cv2.rectangle(threshold_img, (x, y), (x + w, y + h), (
0, 255, 0), 2)
# display image
cv2.imshow('captured text', threshold_img)

# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()


#Note: In step (if int(details['conf'][sequence_number]) > 30:), consider only those images whose confidence score is greater than 30. 
Get this value by manually looking at the dictionary’s text file details and confidence score.
 After this, verify that all the text results are correct even if their confidence score
 is between 30-40. You need to verify this because images have a mixture of digits,
 other characters, and text. And it is not specified to 
Tesseract that a field has only text or only digits. 
Provide the whole document as it is to Tesseract and wait for 
it to show the results based on the value whether it belongs to text or digits.




parse_text = []
word_list = []
last_word = ''
for word in details['text']:
if word != '':
word_list.append(word)
last_word = word
if (last_word != '' and word == '') or (word == details['text'][-1]):
parse_text.append(word_list)
word_list = []

#The next code will convert the result text into a file:

with open('result_text.txt', 'w', newline="") as file: csv.writer(file, delimiter=" ").writerows(parse_text)


THE OUTPUT FILE IS :result_text.txt (will be inside the project folder)

Limitations of Tesseract

  • The OCR’s accuracy is not as apt as compared to some currently available commercial solutions.
  • It is not capable of recognizing handwritten text.
  • If a document contains languages that are not supported by Tesseract then results will be poor.
  • It requires a clear image as input. A poor quality scan may produce poor results in OCR.
  • It doesn’t give accurate results of the images affected by artifacts including partial occlusion, distorted perspective, and complex background.
  • It is not good at analyzing the normal reading order of documents. For example, you might fail to recognize that a document contains two columns, and might try to join the text across those columns.
  • It does not expose the font family’s text information.

Comments

Popular posts from this blog

Python to automate What's App messages

Install WAMP server to run python

Building RESTful APIs with Flask in PyCharm