Saturday, January 30, 2010

OCR Software versus Document Capture Software

OCR Software versus Document Capture Software

So all OCR Software companies provide the ability to convert scanned files into text or searchable PDFs via the Optical Character Recognition process, but how do I capture/scan the images so the applications can do their conversion?

This is an interesting question.  Let's talk about Document Capture first.  This type of application is built from the ground up to scan/capture documents at a high rate of speed, provide the means to collect information about the documents through a number of means, and then export the document/data to a back end repository.  All document capture companies provide all types of OCR options, and usually OEM their OCR, ICR, OMR components from the major OCR application companies, like:  ABBYY, OpenText, Nuance, ReadSoft, etc.  Most of these companies have diversified their offering to include document capture, but their offerings far way short on the capture side in my opinion...they are OCR companies.

The real goal here is to get the best OCR results possible through a powerful OCR engine, and also minimize your time required to scan and process through the best document capture software.  So, if you are looking to do high volume OCR processing, I highly recommend choosing a capture application that utilizes your OCR engine of choice to get the best of both worlds.  I will write more on this topic in upcoming posts.  If you want some guidance on How to pick the right OCR Software, click on the link text.

Tuesday, January 26, 2010

Microsoft SharePoint and OCR

Microsoft SharePoint and OCR

Scanning with Microsoft SharePoint is an interesting endeavor, and typically the main reason for this undertaking is to have a searchable body of information.  So what type of Optical Character Recognition (OCR) Software can be utilized with SharePoint?   First of all, all the same rules apply in picking the right recognition software to do the conversion from image to text, as outlined in "How do I pick the right OCR Software?".  You need to evaluate what you are trying to accomplish and look at your business process and workflows to get a good idea of how to initiate the conversion process.  Below are some key questions when evaluating a SharePoint OCR Solution.

Are your paper images scanned en masse, through a centralized capture process?

If this is the case, you would typically do all of your OCR processing and recognition in front end document capture software.  These application provide the fastest OCR engines, and their recognition processing time can be anywhere from 100-600 pages per minute, depending on the types of pages you are scanning. 

Do you utilize MFPs / Copiers to scan document to sharepoint?

Most companies are trying to leverage their investment in their copier hardware to provide end users a great scanning and capture onramp to SharePoint.  In this case, you typically want an OCR application that can provide recognition on the fly, and do the conversion process behind the scenes.  Their are many MFP integrated applications on the market that can provide the OCR engine: iCapture, NSI AutoStore, eCopy to name a few.

Do the end users compile, combine and work with documents at their desktops?

In environments where end users are constantly working in their documents, and need desktop scanning access, typically and OCR Desktop application can be the best solution.  These applications can put the control of the conversion process in the end user's hands, and can provide them OCR capability at the click of the mouse.  Some apps in this class are eCopy Paperworks, PaperPort and OmniPage.

Do you want to SharePoint OCR PDFs?
Knowing what format and how you want to search can be critical, and having OCR PDFs in SharePoint can allow for full text search.

All of the OCR Solutions on this page focus on doing the process before the documents hit SharePoint.  I will write an article later on solutions that can OCR documents within SharePoint Libraries later.

Saturday, January 23, 2010

What is OCR, ICR and OMR?

What is OCR, ICR and OMR?

In the area of text conversion, there is often confusion on the acronyms that surround the industry, and what each one designates.  Below are some quick overviews of each of the recognition technologies, and what they accomplish:

Optical Character Recognition (OCR) Software

OCR Software takes images, and converts them to searchable text.  The output can be a plain text file, or the industry standard today is an image with hidden text PDF.  OCR can also be utilized to extract data from scanned images, providing a means to either harvest information, or create index fields for later search. OCR Software Definition

Intelligent Character Recognition (ICR) Software

ICR Software provides the ability to recognize handwritten, or hand printed text.  This process can be extrememly accurate when the printed text is bound by boxes, or combed form fields.  Hanwriting is a little more complex, and typically requires many samples to be accurate. ICR Software Definition

Optical Mark Recognition (OMR) Software

OMR Software, somtimes called "mark sesnse", provides the ability to read checked boxes on forms or documents.  The software senses the difference between an unmarked and marked box using a baseline reading, and then allows the recognition to take place.

Many manufacturers combine all 3 into a single recognition engine that provides powerful analysis capabilities for scanned documents and forms.  OMR Software Definition

Saturday, January 16, 2010

OCR Software and Image Processing

OCR Software and Image Processing

Why is image processing so important when utilizing Optical Character Recognition Software?

In order to get the highest possible accuracy with your OCR Application, the recognition process needs to have a clean image to examine.  The most important are auto-orientation, deskew and despeckle.  The Auto-orientation process examines tha page, and makes sure it is oriented correclty for the whole recognition process.  Deskew examines the page for any skewing, whcih may occur during the scan process, and "rights" the page to make sure the text is inline throughout the page.  Despeckle takes away any speckles on the page that can be falsely identified as font characters, but also can be attributed to any misreads of characters.

Older documents may require other functions, such as font improvement and deshading to insure the highest possible accuracy in the overall OCR process.