Wednesday, February 24, 2010

OCR Software Post

OCR Software for Business


8 Things About OCR

Tuesday, February 23, 2010

OCR Software and Character Correction

Optical Character Recognition and Character Correction

So what is character correction when associated with OCR?  The OCR process provides the recognition and conversion of images to text, and in this process, there can be many characters that can be misidentified throughout the conversion process.  Typically, document capture applications provide the ability to identify commonly misinterpreted characters through a table of correction mappings.  So lets say a particular zone OCR field was designated as numbers only, and the engine interpreted an "l" for a "1" (that is an l for a one).  The correction piece of the recognition engine can provide logic to the OCR process, and make sure the text is properly interpreted. This can be really important, especially in SharePoint OCR environments where you need searchable PDFs in SharePoint.

This is just one of many ways to improve accuracy, but note you will need the right kind of OCR application that allows this feature to be enabled.

Friday, February 19, 2010

Are index fileds really necessary when you have the full text OCR?

Full text OCR

Ah, the old debate, do I just perform optical character recognition on all my scanned documents, make them searchable OCR PDFs, and rely on the OCR to retrieve documents?  Why use index fields when I already have all the converted text?

Index fields, or performing the indexing process, provides structured data about the documents.  This data can be utilized, especially when using document capture software, to link into columns and index fields in your document management system.  Index fields provide faster retrieval, especially if you want to be able to retrieve through specifying several criteria.  Relying on OCR, or the recognized text can get you in trouble.  First of all, you are assuming that the document will alwyas have recognized text, and that all the items that you are searching for are in the text.  Secondly, depdning on the type of OCR format you have, you may have to just find the document, and then open and parse what you are looking for.  This can also lead to false positives in retrieval if many documents have the same terms in their OCR text.

Tuesday, February 16, 2010

Zone OCR and Accuracy within Recognition Zones

Zone OCR Accuracy

So when doing zone OCR , or Optical Character Recognition on a portion of a page, what features do I need to ensure I have the best possible accuracy.  List below:

  • Utilize a document capture application that provides some type of page registration.  The problem with using zone OCR is that most engines utilize a set template of coordinates on the page, and just repeat this "zone" on each page.  If the scanner is off, or the page skewed, you can have erroneous readings.  Page registration gives the recognition engine the ability to anchor a page feature, always referencing the zone from the set coordinates of the feature.
  • Utilize a scanning application that provides the ability to perform image processing on the zone prior to running Optical Character Recognition . Removing lines, deshading, despeckling can provide a cleaner zone, and thus improve overall accuracy.
  • Some advanced capture applications provide the ability to filter zones based on character sets.  This allows you to interpret the characters within a zone as say, all numbers, or perhaps a date, which provides the engine a more narrower character set for the whole recognition process.  iCapture for example, not only allows character set mapping to zone ocr templates, but also provides auto-correction for the most commonly misinterpreted characters.
  • Finally, and highly recommended for the highest level of accuracy, is the ability to set a character matching filter for a zone.  This technology, sometimes called ADE, provides the ability to utilize regular expressions to ensure a match, and lets you over draw the recognition area / zone and filter to your liking.

Saturday, February 13, 2010

Why use OCR Software to perform full text conversion of images?

OCR Software

When we scan documents, they are just images, pictures of our paper.  For many organizations, this scanned image is exactly what they need, and a little index information about the document is sufficient to provide them with retrieval capability.

So why take the time and spend the money to utilize OCR Software to convert the scanned document to a searchable format?  Below are some reasons to always perform full text OCR of scanned documents:

  1. Always provide every means possible for retrieval.  Just using index fields to search for scanned documents may seem like a fantastic idea, but what if the document is misidentified?  Or the indexer enters incorrect information?  Performing a full text OCR of the document can provide an insurance policy that a document can always be found through full text search.
  2. Document Capture software today provides fast reliable OCR.  Most capture software on the market provides the ability to automatically convert the documents to searchable format for a small expense.  Some of the engines on the market can do the conversion at 100+ pages per minute, so there is really not much time wasted in the OCR conversion / recognition process.
  3. OCR to PDF for a format that contains both image and text in one container.  Adobe provides the PDF image with hidden text option to give you a seachable file format that contains a pristine image.
  4. Plan for the worst case.  Audits...legal issues...sometimes you need to search beyond the index fields, and full text can give you the ability to find the needle in the haystack.
OCR applications give you the means and capabilities to convert images to searhcable formats and there are many reasons to do the full text conversion.

Tuesday, February 9, 2010

OCR Software - Distributed vs. Centralized

OCR Software - Distributed vs. Centralized

Ah, the centralized versus distributed question...it is one that is continually asked in the scanning, capture and document capture space.  Most associate OCR Software with familiar desktop applications like eCopy Desktop, OmniPage, PaperPort, etc.  These provide, in a way, distribution of the overall OCR process to end users.

There are applications on the market that can provide centralized and controlled OCR capabilities, through either a server or a workstation deployment.  One example is PSI:Capture from PSIGEN, and advanced document capture application, that allows centralized OCR processing.  Why would you want to do this?  Well, in most cases, this type of OCR deplyment model is utilized in conjunction with a document capture system, for centralized capture, indexing, QA, OCR and migration to a centralized DM / ECM system.  Typically, these systems give a broad and expansive feature set, providing all different types of OCR functionality.

Tuesday, February 2, 2010

How fast is OCR Software? OCR Performance Testing

OCR Performance Testing

So which Desktop Optical Character Recognition Software is the fastest? Has the best overall performance when converting images to Word? When converting images to PDF?




I ran some testing with 4 basic desktop OCR applications to see which would have the fastest conversion times. The OCR applications are:



-eCopy Desktop (Uses the ReadIRIS OCR Engine)

-Adobe 8

-Paperport 11

-OmniPage 15



I ran all the tests on a 9 month old laptop, with a Dual Core 2 GHz processor, and 2 GB of memory. I utilized all the "out of the box" settings on the apps, with no performance tuning of settings, and I timed the speed of the applications to convert a 100 page TIFF image to Word and to Adobe Image and Text PDF.



Results of the OCR Speed Test in minutes and seconds(Word/PDF):



eCopy Desktop 4:25/2:58

Adobe 8 3:54/3:22

OmniPage 15 2:16/2:16*

PaperPort 11 2:35**



*With OmniPage you run the conversion process and then save to your preferred format.

**PaperPort just had text conversion capabilities.



I have to note that the eCopy Desktop test can be misleading in that it performs auto-orientation on all the pages before performing OCR. Also note that when evaluating an OCR application, speed is not the only factor. You need to decide up front whether you want speed, accuracy, both, or want to focus on formatting. I will write another article on formatting and which application is best in the near future.