OCR Software: Optical Character Recognition

Showing posts with label Optical Character Recognition. Show all posts

Sunday, June 10, 2012

OCR and SharePoint: What features do I need?

As many organizations go down the road to place scanned documents into SharePoint, there are several areas of key focus. A little planning will help to leverage OCR technology, and pre-OCR documents before they are placed in a SharePoint library as PDFs. So what is the true value of OCR in any SharePoint deployment? It all depends on what you are trying to achieve. The Scanning with SharePoint BLOG has a great post on what to evaluate before you start the scanning process: How do you want to find your documents in SharePoint? Below are some ways to utilize OCR, and some definitions of key types:

Full Text OCR - Optical Character Recognition for SharePoint, or OCR is typically associated with conversion of an image to full text. When you scan a document, it is a pure image, and the text within is not searchable, nor can you copy and paste. The OCR process can give you pdfs that can be indexed by SharePoint Search. Is Full Text OCR Necessary? Read the link for some thoughts.
Zone OCR - Zone OCR can be utilized to extract information from a specific location on a repeatable form. The information collected can be automatically entered into a SharePoint column. This is a huge time save if you need to automatically collect information from a large volume of forms, and Optical Character Recognition by zone can really help speed up the process.
Advanced Data Extraction (ADE) - This is the ultimate in efficiency and automation, and only a few apps give you this OCR functionality without an exorbitant cost. In a nutshell, ADE provides pattern matching for information extraction. So if you are looking for a 6 digit number, it auto-extracts this information. During the OCR process, ADE adds to accuracy and speed by finding only what you need. inForm has a great product for SharePoint Capture and OCR that can provide a robust ADE engine.
Point and Click OCR - Point and Click OCR allows you to use the mouse to choose what you want to throw into a SharePoint field. The images are pre-OCR'd or the process is performed real time to give you the desired information.
Rubberband OCR - this method of OCR processing allows you to drag your mouse over an area of text and auto-enter the data into a SharePoint column. It is great for information that spans multiple lines, and can convert the text in the image quite easily.

Tuesday, May 15, 2012

What types of OCR Software are there?

In examining Optical Character Recognition (OCR) software, you need to examine your needs and determine what type you require.

Desktop OCR Software

For day to day use, most users will utilize Desktop OCR software. It is appropriate for converting scanned documents to Word format, copying and pasting sections from documents, etc. Apps that fall into this category are OmniPage, PaperPort, etc.

Batch OCR and Capture Software

If you are processing large volumes of documents, and need to enable a process or workflow with scanners in your company, typically you will utilize document capture software with enhanced OCR capabilities. This type of OCR Software takes processing to the next level and uses automation to extract information from the documents, as well as make them searchable PDF documents. An example of this type of software is inForm's iCapture.

Wednesday, May 9, 2012

OCR and PC Architecture

So just how important is your PC hardware when looking to use OCR Software? Many of the desktop products do not take advantage of multi-core CPUs, and can have laggard performance numbers when it comes to Optical Character Recognition, Intelligent Character Recognition and Optical Mark Recognition. Currently, playing with PSI:Capture, which offers a number of OCR options, and they have single, dual and quad core enablement in their licensing. Dual core runs about 1.7 times the speed, and quad core gives a 2.7x improvement.

Tuesday, February 23, 2010

OCR Software and Character Correction

Optical Character Recognition and Character Correction

So what is character correction when associated with OCR? The OCR process provides the recognition and conversion of images to text, and in this process, there can be many characters that can be misidentified throughout the conversion process. Typically, document capture applications provide the ability to identify commonly misinterpreted characters through a table of correction mappings. So lets say a particular zone OCR field was designated as numbers only, and the engine interpreted an "l" for a "1" (that is an l for a one). The correction piece of the recognition engine can provide logic to the OCR process, and make sure the text is properly interpreted. This can be really important, especially in SharePoint OCR environments where you need searchable PDFs in SharePoint.

This is just one of many ways to improve accuracy, but note you will need the right kind of OCR application that allows this feature to be enabled.

Tuesday, February 16, 2010

Zone OCR and Accuracy within Recognition Zones

Zone OCR Accuracy

So when doing zone OCR , or Optical Character Recognition on a portion of a page, what features do I need to ensure I have the best possible accuracy. List below:

Utilize a document capture application that provides some type of page registration. The problem with using zone OCR is that most engines utilize a set template of coordinates on the page, and just repeat this "zone" on each page. If the scanner is off, or the page skewed, you can have erroneous readings. Page registration gives the recognition engine the ability to anchor a page feature, always referencing the zone from the set coordinates of the feature.
Utilize a scanning application that provides the ability to perform image processing on the zone prior to running Optical Character Recognition . Removing lines, deshading, despeckling can provide a cleaner zone, and thus improve overall accuracy.
Some advanced capture applications provide the ability to filter zones based on character sets. This allows you to interpret the characters within a zone as say, all numbers, or perhaps a date, which provides the engine a more narrower character set for the whole recognition process. iCapture for example, not only allows character set mapping to zone ocr templates, but also provides auto-correction for the most commonly misinterpreted characters.
Finally, and highly recommended for the highest level of accuracy, is the ability to set a character matching filter for a zone. This technology, sometimes called ADE, provides the ability to utilize regular expressions to ensure a match, and lets you over draw the recognition area / zone and filter to your liking.

Saturday, February 13, 2010

Why use OCR Software to perform full text conversion of images?

OCR Software

When we scan documents, they are just images, pictures of our paper. For many organizations, this scanned image is exactly what they need, and a little index information about the document is sufficient to provide them with retrieval capability.

So why take the time and spend the money to utilize OCR Software to convert the scanned document to a searchable format? Below are some reasons to always perform full text OCR of scanned documents:

Always provide every means possible for retrieval. Just using index fields to search for scanned documents may seem like a fantastic idea, but what if the document is misidentified? Or the indexer enters incorrect information? Performing a full text OCR of the document can provide an insurance policy that a document can always be found through full text search.
Document Capture software today provides fast reliable OCR. Most capture software on the market provides the ability to automatically convert the documents to searchable format for a small expense. Some of the engines on the market can do the conversion at 100+ pages per minute, so there is really not much time wasted in the OCR conversion / recognition process.
OCR to PDF for a format that contains both image and text in one container. Adobe provides the PDF image with hidden text option to give you a seachable file format that contains a pristine image.
Plan for the worst case. Audits...legal issues...sometimes you need to search beyond the index fields, and full text can give you the ability to find the needle in the haystack.

OCR applications give you the means and capabilities to convert images to searhcable formats and there are many reasons to do the full text conversion.

Tuesday, February 9, 2010

OCR Software - Distributed vs. Centralized

OCR Software - Distributed vs. Centralized

Ah, the centralized versus distributed question...it is one that is continually asked in the scanning, capture and document capture space. Most associate OCR Software with familiar desktop applications like eCopy Desktop, OmniPage, PaperPort, etc. These provide, in a way, distribution of the overall OCR process to end users.

There are applications on the market that can provide centralized and controlled OCR capabilities, through either a server or a workstation deployment. One example is PSI:Capture from PSIGEN, and advanced document capture application, that allows centralized OCR processing. Why would you want to do this? Well, in most cases, this type of OCR deplyment model is utilized in conjunction with a document capture system, for centralized capture, indexing, QA, OCR and migration to a centralized DM / ECM system. Typically, these systems give a broad and expansive feature set, providing all different types of OCR functionality.

Tuesday, February 2, 2010

How fast is OCR Software? OCR Performance Testing

OCR Performance Testing

So which Desktop Optical Character Recognition Software is the fastest? Has the best overall performance when converting images to Word? When converting images to PDF?

I ran some testing with 4 basic desktop OCR applications to see which would have the fastest conversion times. The OCR applications are:

-eCopy Desktop (Uses the ReadIRIS OCR Engine)

-Adobe 8

-Paperport 11

-OmniPage 15

I ran all the tests on a 9 month old laptop, with a Dual Core 2 GHz processor, and 2 GB of memory. I utilized all the "out of the box" settings on the apps, with no performance tuning of settings, and I timed the speed of the applications to convert a 100 page TIFF image to Word and to Adobe Image and Text PDF.

Results of the OCR Speed Test in minutes and seconds(Word/PDF):

eCopy Desktop 4:25/2:58

Adobe 8 3:54/3:22

OmniPage 15 2:16/2:16*

PaperPort 11 2:35**

*With OmniPage you run the conversion process and then save to your preferred format.

**PaperPort just had text conversion capabilities.

I have to note that the eCopy Desktop test can be misleading in that it performs auto-orientation on all the pages before performing OCR. Also note that when evaluating an OCR application, speed is not the only factor. You need to decide up front whether you want speed, accuracy, both, or want to focus on formatting. I will write another article on formatting and which application is best in the near future.

Tuesday, January 26, 2010

Microsoft SharePoint and OCR

Microsoft SharePoint and OCR

Scanning with Microsoft SharePoint is an interesting endeavor, and typically the main reason for this undertaking is to have a searchable body of information. So what type of Optical Character Recognition (OCR) Software can be utilized with SharePoint? First of all, all the same rules apply in picking the right recognition software to do the conversion from image to text, as outlined in "How do I pick the right OCR Software?". You need to evaluate what you are trying to accomplish and look at your business process and workflows to get a good idea of how to initiate the conversion process. Below are some key questions when evaluating a SharePoint OCR Solution.

Are your paper images scanned en masse, through a centralized capture process?

If this is the case, you would typically do all of your OCR processing and recognition in front end document capture software. These application provide the fastest OCR engines, and their recognition processing time can be anywhere from 100-600 pages per minute, depending on the types of pages you are scanning.

Do you utilize MFPs / Copiers to scan document to sharepoint?

Most companies are trying to leverage their investment in their copier hardware to provide end users a great scanning and capture onramp to SharePoint. In this case, you typically want an OCR application that can provide recognition on the fly, and do the conversion process behind the scenes. Their are many MFP integrated applications on the market that can provide the OCR engine: iCapture, NSI AutoStore, eCopy to name a few.

Do the end users compile, combine and work with documents at their desktops?

In environments where end users are constantly working in their documents, and need desktop scanning access, typically and OCR Desktop application can be the best solution. These applications can put the control of the conversion process in the end user's hands, and can provide them OCR capability at the click of the mouse. Some apps in this class are eCopy Paperworks, PaperPort and OmniPage.

Do you want to SharePoint OCR PDFs?
Knowing what format and how you want to search can be critical, and having OCR PDFs in SharePoint can allow for full text search.

All of the OCR Solutions on this page focus on doing the process before the documents hit SharePoint. I will write an article later on solutions that can OCR documents within SharePoint Libraries later.

Saturday, January 16, 2010

OCR Software and Image Processing

OCR Software and Image Processing

Why is image processing so important when utilizing Optical Character Recognition Software?

In order to get the highest possible accuracy with your OCR Application, the recognition process needs to have a clean image to examine. The most important are auto-orientation, deskew and despeckle. The Auto-orientation process examines tha page, and makes sure it is oriented correclty for the whole recognition process. Deskew examines the page for any skewing, whcih may occur during the scan process, and "rights" the page to make sure the text is inline throughout the page. Despeckle takes away any speckles on the page that can be falsely identified as font characters, but also can be attributed to any misreads of characters.

Older documents may require other functions, such as font improvement and deshading to insure the highest possible accuracy in the overall OCR process.

Wednesday, December 30, 2009

Optical Character Recognition (OCR) and Capture

Optical Character Recognition (OCR) and Capture

So what is document capture software and what does it have to do with OCR applications. So, I think first, we need to differentiate between scanning software and capture software. Here is a good blog post that goes over the differences, with regards to SharePoint Scanning. Scanning Software just gives you the ability to convert paper to a digital form, and then OCR. Capture Software takes this a step further, and is really a catalyst for some enhanced processing with your recognition engine. Typical capture software will allow you to perform zone OCR, scan multiple documents in a single stack through separation, perform OCR based separation or even analyze the OCR text for expressions and then automatically extract the data. Document Capture software provides enhanced data extraction, as an example, as do other vendors like Kofax, AnyDoc, Captiva, etc.

So, I guess the whole point here is that OCR software in most cases just provides a basic framework for the conversion process. you really need a capture application to harness the true power of any OCR or recognition engine.

Sunday, December 27, 2009

How do I pick the right OCR Software?

In the space of OCR Software, or Optical Character Recognition, it can be confusing to say the least on which option you should pick. It really comes down to the use case, or how you will utilize the software. Below are some great question to ask your self:

What do I need to convert with my OCR Software?
This question is very important, and it really comes down to what you are looking to output with your software. Do you want a word file that you can edit, or are you just looking to create a searchable PDF? Many engines are tuned for accuracy, and will give you the best formatted output, others are built for speed. Omni-page is an excellent engine for creating nicely formatted output, but can be rather slow due to its focus on acuracy. A production engine, like PSI:Capture, which offers multiple OCR choices, can give you great flebility, no matter your ouput choice.

Are they pre-existing images, or ones that I will scan? PDFs or TIFFs?
It is really important when you are choosing Optical Character Recognition Software, to make sure that you have all the functionality you require, whether you are scanning, or just processing non-searchable PDFs from a directory. Most of the OCR Software will let you choose the file that you perform recognition on, and others will let you scan in paper for conversion. If you are utilizing MFPs or Scanning copiers, and want to perform OCR on the scanned documents, you may want to choose a product that performs auto-import, or one that is focused on MFP Scanning. Also, you want flexibility in the types of file you can process, and want to be able to OCR any image type: PDF, TIFF, JPG, GIF, BMP, etc.
How fast can I do conversions?
So, some engines are built for OCR Accuracy, others built for speed in the OCR process. Most of the desktop engines, like eCopy Desktop, provide a good mix of both. Other engines, like Glyphreader or Docustar, provide the ability to choose whether you want speed or accuracy in your OCR results. It is always good to choose a document capture option that allows you multiple OCR engine options to perform diffferent recognition tasks.

How ddo I get the best accuracy in the OCR ouput?
All of the OCR Software mentioned within this post reuires a high quality image for the best recognition accuracy. With that said, a high quality scanning software with image processing options will lead to the best OCR accuracy when converting from image to text. So what does image processing have to do with OCR Software? The cleaner the image, the better the accuracy, and if you can deskew, despeckle, deshade and sharpen text, you will get better OCR results.

Sunday, December 13, 2009

What is Zone OCR?

What is Zone OCR?

Zone OCR Software provides the ability to focus in on just a single, or multiple, sections (zones) of a scanned document or image. Converting specific zones to text is an important optical character recognition feature set, and one that can be applied in just about any business type. Its main use is to harvest values from images, and utilize them as index values, to provide search capability later. Not all zone OCR engines are equal, and you typically need a very accurate engine to produce the required results. Some accurate engines include Glyphreader, Recostar, Docustar and many others.

It is often imperative to "clean up" the zone prior to attempting the conversion to text. Clean up can include line removal, despeckle, deskew, etc., which are found in almost any product that provides OCR and Image Processing features.

Sunday, November 29, 2009

What is OCR? (Optical Character Recognition)

OCR Definition

OCR software or Optical Character Recognition Software is a function of certain software applications that provides the means to convert images, or portions of images to text. Scanned documents are almost always create as non-text image formats, such as TIFF, PDF, JPG, etc. The process of basic OCR makes them searchable, and thus more useful when you require the ability to search the contents of scanned documents. The core system uses a combination of pattern recognition and artifical intelligence to interpret the images, and create the most accurate output. Many of the more popular engines provide the ability to output not only to text, but word processor formts, HTML, PDF, etc.

OCR Software

OCR Software

This is a new blog dedicated to OCR Software, OCR Technologies and Optical Character Recognition Software review. It covers topics like SharePoint OCR.