18 Apr 2018
Relative Study of Document Layout Analysis Algorithms for Printed Document Images
In the following survey paper, the different algorithms that could be used for document layout analysis have been studied and their results have been compared. For the removal of image mask, Bloomberg’s algorithm and CRLA have been described. For the purpose of text segmentation, we have studied the Recursive XY Cut algorithm, RLSA and RLSO algorithms.
Physical layout analysis of printed document images is the first step of the OCR conversion. For the OCR to work effectively, we need to provide an input wherein no images are present in the document i.e. the image contains only text. If this is not done properly, the OCR will return garbage values. To avoid this, we have discussed two algorithms, Bloomberg’s Algorithm and CRLA that could be used for the removal images from the document images.
The next step is the text segmentation wherein we find the text blocks inside the document. The coordinates of these text blocks are then passed as input to the OCR. To perform this segmentation, we have discussed the recursive XY cut algorithm, the RLSA and RLSO algorithms.
The first step in the document layout analysis is to remove the images present in the original document. We will be discussing the Bloomberg’s algorithm along with its variations and the CRLA algorithm for image removal.
The Bloomberg’s algorithm is primarily used to find the image mask of halftone images. The implementation of this algorithm uses basic morphological operations. The algorithm has the following steps:
The main issue with Bloomberg’s algorithm is that it is unable to distinguish between text and sketches (i.e. line drawings) in a printed document image.
CRLA stands for Constraint Run Length Algorithm. In this algorithm we apply horizontal and vertical smoothening to the document image to get a clear separation between text and images in the document.
Enhanced CRLA is used to smooth out only the text part in the image and avoid smoothening of non-textual part of the document image.
At step 9 we extract the text part of the document image and at step 15 we extract the non-text part of the document image.
The main advantage of the CRLA algorithm is that clear separation of text and non-text part of the document image. It also works for sketches as well as halftones effectively. It has considerably less complexity as selective smoothening is done.
However, after the removal of the non-textual part of the document image, some stray pixels remain the image. The connected components in the halftone image whose height is less than 1cm are assumed as text elements in the algorithm. This results in presence of unwanted components in the final image.
The next step in the document layout analysis is the segmentation of text into text blocks that could be provided as input to the OCR. The following algorithms have been studied for this:
The recursive XY cut algorithm is used for obtaining text blocks from an image that does not contain any images from the original printed document. The XY cut algorithm works in the following way:
One of the problems with XY cut algorithm is that there is no method to find a threshold that will work for all the documents. Instead, a new threshold needs to be determined for each document and this cannot be done without manual intervention.
Another major issue with the recursive XY algorithm is the time complexity. The recursive XY cut algorithm requires a large time to complete execution. Despite these disadvantages, this algorithm successfully separates the text blocks provided that a manual threshold is provided.
The Run-Length Smoothing Algorithm (RLSA) works on black & white scanned images of documents. It finds runs of white pixels and converts them into black pixels whenever they are less than a given threshold. The RLSA works in four steps:
A simplified version of the RLSA, RLSO (Run-Length Smoothing with OR) works as follows:
The RLSA algorithm returns rectangular frames of documents with Manhattan Layouts. On the other hand, RLSO algorithm also works well with non-Manhattan layouts. The problem with both RLSA and RLSO is that the threshold for smoothing needs to be determined manually. Also the threshold required for each document image is different and it is almost impossible to be determined manually.
We have compared the above given algorithms for the document layout analysis. During our research we found that, while Bloomberg’s algorithm faces problems for images that contain sketches, CRLA faces problems for images that contain extremely small non-textual elements.
We also observed that the recursive XY Cut algorithm and RLSA both do not work on printed documents having non-Manhattan layouts. On the other hand, the RLSO algorithm gives comparatively better results for Manhattan as well as non-Manhattan layouts. However, all three algorithms mentioned above face the common problem of manual threshold determination which is document specific.
If you are the real writer of this essay and no longer want to have the essay published on the our website then please click on the link below to send us request removal:Request the removal of this essay
Get in touch with our dedicated team to discuss about your requirements in detail. We are here to help you our best in any way. If you are unsure about what you exactly need, please complete the short enquiry form below and we will get back to you with quote as soon as possible.