Wednesday, December 30, 2009

Optical Character Recognition (OCR) and Capture

Optical Character Recognition (OCR) and Capture

So what is document capture software and what does it have to do with OCR applications.  So, I think first, we need to differentiate between scanning software and capture software.  Here is a good blog post that goes over the differences, with regards to SharePoint Scanning.  Scanning Software just gives you the ability to convert paper to a digital form, and then OCR.  Capture Software takes this a step further, and is really a catalyst for some enhanced processing with your recognition engine.  Typical capture software will allow you to perform zone OCR, scan multiple documents in a single stack through separation, perform OCR based separation or even analyze the OCR text for expressions and then automatically extract the data.  Document Capture software provides enhanced data extraction, as an example, as do other vendors like Kofax, AnyDoc, Captiva, etc.

So, I guess the whole point here is that OCR software in most cases just provides a basic framework for the conversion process.  you really need a capture application to harness the true power of any OCR or recognition engine.

Sunday, December 27, 2009

How do I pick the right OCR Software?

In the space of OCR Software, or Optical Character Recognition, it can be confusing to say the least on which option you should pick.  It really comes down to the use case, or how you will utilize the software.  Below are some great question to ask your self:

What do I need to convert with my OCR Software? 
This question is very important, and it really comes down to what you are looking to output with your software.  Do you want a word file that you can edit, or are you just looking to create a searchable PDF?  Many engines are tuned for accuracy, and will give you the best formatted output, others are built for speed.  Omni-page is an excellent engine for creating nicely formatted output, but can be rather slow due to its focus on acuracy.  A production engine, like PSI:Capture, which offers multiple OCR choices, can give you great flebility, no matter your ouput choice.

Are they pre-existing images, or ones that I will scan?  PDFs or TIFFs?
It is really important when you are choosing Optical Character Recognition Software, to make sure that you have all the functionality you require, whether you are scanning, or just processing non-searchable PDFs from a directory.  Most of the OCR Software will let you choose the file that you perform recognition on, and others will let you scan in paper for conversion.  If you are utilizing MFPs or Scanning copiers, and want to perform OCR on the scanned documents, you may want to choose a product that performs auto-import, or one that is focused on MFP Scanning.  Also, you want flexibility in the types of file you can process, and want to be able to OCR any image type:  PDF, TIFF, JPG, GIF, BMP, etc.
How fast can I do conversions?
So, some engines are built for OCR Accuracy, others built for speed in the OCR process. Most of the desktop engines, like eCopy Desktop, provide a good mix of both.  Other engines, like Glyphreader or Docustar, provide the ability to choose whether you want speed or accuracy in your OCR results.  It is always good to choose a document capture option that allows you multiple OCR engine options to perform diffferent recognition tasks.

How ddo I get the best accuracy in the OCR ouput?
All of the OCR Software mentioned within this post reuires a high quality image for the best recognition accuracy.  With that said, a high quality scanning software with image processing options will lead to the best OCR accuracy when converting from image to text.  So what does image processing have to do with OCR Software?  The cleaner the image, the better the accuracy, and if you can deskew, despeckle, deshade and sharpen text, you will get better OCR results.

Sunday, December 13, 2009

What is Zone OCR?

What is Zone OCR?

Zone OCR Software provides the ability to focus in on just a single, or multiple, sections (zones) of a scanned document or image.  Converting specific zones to text is an important optical character recognition feature set, and one that can be applied in just about any business type.  Its main use is to harvest values from images, and utilize them as index values, to provide search capability later.  Not all zone OCR engines are equal, and you typically need a very accurate engine to produce the required results. Some accurate engines include Glyphreader, Recostar, Docustar and many others.

It is often imperative to "clean up" the zone prior to attempting the conversion to text.  Clean up can include line removal, despeckle, deskew, etc., which are found  in almost any product that provides OCR and Image Processing features.

Monday, December 7, 2009

Open Source OCR Software

Open Source OCR Software

The open source  movement has created some great OCR Software / Optical Character Recognition Software.  Below are links and info:

OCRopus OCR Software
This is a project sponsored by Google, and is a state of the art OCR application.  It is focused on high volume OCR needs, and includes a conversion engine, layout analysis, modeling and multi-lingual capabilities.

OCRopus OCR Software Download

GOCR OCR Application
Developed under the GNU Public License, is can be used with various front ends to convert immages to text, and is open to different image formats.

GOCR OCR Application Download

Tesseract OCR Engine
Engine developed by HP in the late 80s when OCR Software was in its infancy.  Google uses the engine in its OCRopus.  Document Capture companies like PSIGEN have made the Tesseract Engine an option for afvanced capture.

Tesseract OCR Engine Download

Saturday, December 5, 2009

What is OCR Software?

OCR Software and Application Features

We typically use the term OCR (Optical Character Recognition) to mean the conversion of an image to text. There are several OCR functions that are utilized in advanced document capture software, such as PSIGEN PSI:Capture, and your every day scanning software, such as VizitSP.

For an in depth definition, go here ->  Wikipedia OCR Software

So let’s talk a bit about the different OCR functionalities:

Full Text OCR

Full Text OCR takes the entire image and converts it to a text output. The OCR output can be in several formats, including: Searchable PDF, Microsoft Word, Plain Text, HTML, etc. The main goal of full text OCR is typically “searchability”, and the results are usually placed into a backend repository, such as Microsoft SharePoint (For more about SharePoint Scanning and Capture – SharePoint Scanning).

Zone OCR

Zone OCR only looks at a particular region, or zone, of the scanned page and converts just that portion to text. There are several reasons to use zone OCR rather than full text:

• You only want to search on the information in that zone.

• Full Text OCR takes much longer, so Zone OCR can speed up processing time.

• You want to extract the contents of the zone, and place it into an index field.

Most advanced document capture applications provide the ability to map the contents of a zone to an index field, that can then provide granular search capabilities based only on that field.

OCR-Assisted Indexing

OCR-Assisted Indexing, or point-and-click indexing, provide the user the capability to just click on words or segments of a document, and convert that image portion to text. This capability exists in many different capture applications, and provides a simple, easy indexing function on documents.

Rubberband OCR

Rubberband OCR provides the ability to drag a box with the mouse over a portion of text, and automatically convert that segment into text, and even place it into an index field. It is similar to OCR-Assisted Indexing, but allows the capture of large portions of text on a scanned image.

OCR Separation

One of the key challenges in document scanning and capture is the ability to easily split a stack of paper into individual documents. Advanced Document Capture Software can provide the ability split whenever a key term or word is found on a page through the OCR process. Utilizing the Optical Character Recognition engine in this manner can save on document preparation time before scanning and capture.

Advanced Data Extraction

Many of the Document Capture applications on the market today provide a means of extracting data through some type expression matching, or extraction engine. OCR Software is utilized to do the text conversion prior to the extraction.  For an example of data extraction, see this YouTube Video - EOB Processing and Data Extraction.

Forms Identification

Another key use of the OCR results can be the identification of documents. Optical Character Recognition can identify key elements on a document, and then determine how to process it based on those elements.