Extracting text from images with Tesseract OCR, OpenCV, and Python
In the end we will see, it can be concluded that Tesseract is perfect for scanning clean documents and you can easily convert
the image’s text from OCR to word,
pdf to word,
or
to any other required format.
It has pretty high accuracy and font variability. This is very useful in case of institutions where a lot of documentation is involved such as government offices, hospitals, educational institutes, etc. In the current release 4.0, Tesseract supports OCR based deep learning that is significantly more accurate. You can access the code file and input image here to create your own OCR task. Try replicating this task and achieve the desirable results, happy coding!
Coding
Here, I will use the following sample receipt image:pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe' # your path may be different
For Windows Only
1 - You need to have Tesseract OCR installed on your computer.
get it from here. https://github.com/UB-Mannheim/tesseract/wiki
Download the suitable version.
2 - Add Tesseract path to your System Environment. i.e. Edit system variables.
3 - Run pip install pytesseract
and pip install tesseract
4 - Add this line to your python script every time
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe
' # your path may be different
5 - Run the code.
By default, Tesseract considers the input image as a page of text in segments. You can configure Tesseract’s different segmentations if you are interested in capturing a small region of text from the image. You can do it by assigning –psm mode to it. Tesseract fully automates the page segmentation but it does not perform orientation and script detection. The different configuration parameters for Tesseract are mentioned below:
Page Segmentation Mode (–psm): By configuring this, you can assist Tesseract in how it should split an image in the form of texts. The command-line help has 11 modes. You can choose the one that works best for your requirement from the table given below:
mode | Working description |
0psm | Orientation and script detection (OSD) only |
1 | Automatic page segmentation with OSD |
2 | Automatic page segmentation, but no OSD, or OCR |
3 | Fully automatic page segmentation, but no OSD (Default) |
4 | Presume a single column of text of variable sizes |
5 | Assume a single uniform block that has a vertically aligned text |
6 | Assume a single uniform block of text |
7 | Treat the image as a single text line |
8 | Treat the image as a single word |
9 | To treat the image as a single word in a circle |
10 | Treat the image as a single character |
11 | Sparse text. Find as much text as possible not in a particular order |
12 | Sparse text with OSD |
13 | Raw line. Treat the image as a single text line, bypass hack by Tesseract-specific. |
Engine Mode (–oem): Tesseract has several engine modes with different performance and speed. Tesseract 4 has introduced an additional LSTM neutral net mode that works the best. Follow the table given below for different OCR engine modes:
OCR engine mode | Working description |
0 | Legacy engine only |
1 | Neural net LSTM only |
2 | Legacy + LSTM mode only |
3 | By Default, based on what is currently available |
# Import required packages
import cv2
import pytesseract
import csv
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe' # your path may be different
# Read image from which text needs to be extracted
from pytesseract import Output
img = cv2.imread(r'G:\PARAS\anuradha.jpeg')
# Preprocessing the image starts
# Convert the image to gray scale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# converting it to binary image by Thresholding
# this step is require if you have colored image because if you skip this part
# then tesseract won't able to detect text correctly and this will give incorrect result
# Performing OTSU threshold
threshold_img = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
cv2.imshow('threshold image', threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()
# configuring parameters for tesseract (Now, you can see the difference between the original image and the thresholded image. The thresholded image shows a clear separation between white pixels and black pixels. Thus, if you deliver this image to Tesseract,
# it will easily detect the text region and will give more accurate results. To do so, follow the commands given below:)
custom_config = r'--oem 3 --psm 6'
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
print(details.keys())
total_boxes = len(details['text'])
for sequence_number in range(total_boxes):
if int(details['conf'][sequence_number]) > 30:
(x, y, w, h) = (
details['left'][sequence_number], details['top'][sequence_number], details['width'][sequence_number],
details['height'][sequence_number])
threshold_img = cv2.rectangle(threshold_img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# display image
cv2.imshow('captured text', threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()
#Note: In step (if int(details['conf'][sequence_number]) > 30:), consider only those images whose confidence score is greater than 30.
Get this value by manually looking at the dictionary’s text file details and confidence score.
After this, verify that all the text results are correct even if their confidence score
is between 30-40. You need to verify this because images have a mixture of digits,
other characters, and text. And it is not specified to
Tesseract that a field has only text or only digits.
Provide the whole document as it is to Tesseract and wait for
it to show the results based on the value whether it belongs to text or digits.
parse_text = []
word_list = []
last_word = ''
for word in details['text']:
if word != '':
word_list.append(word)
last_word = word
if (last_word != '' and word == '') or (word == details['text'][-1]):
parse_text.append(word_list)
word_list = []
#The next code will convert the result text into a file:
with open('result_text.txt', 'w', newline="") as file: csv.writer(file, delimiter=" ").writerows(parse_text)
THE OUTPUT FILE IS :result_text.txt (will be inside the project folder)
Limitations of Tesseract
- The OCR’s accuracy is not as apt as compared to some currently available commercial solutions.
- It is not capable of recognizing handwritten text.
- If a document contains languages that are not supported by Tesseract then results will be poor.
- It requires a clear image as input. A poor quality scan may produce poor results in OCR.
- It doesn’t give accurate results of the images affected by artifacts including partial occlusion, distorted perspective, and complex background.
- It is not good at analyzing the normal reading order of documents. For example, you might fail to recognize that a document contains two columns, and might try to join the text across those columns.
- It does not expose the font family’s text information.
Comments
Post a Comment