Friday, February 19, 2010

Are index fileds really necessary when you have the full text OCR?

Full text OCR

Ah, the old debate, do I just perform optical character recognition on all my scanned documents, make them searchable OCR PDFs, and rely on the OCR to retrieve documents?  Why use index fields when I already have all the converted text?

Index fields, or performing the indexing process, provides structured data about the documents.  This data can be utilized, especially when using document capture software, to link into columns and index fields in your document management system.  Index fields provide faster retrieval, especially if you want to be able to retrieve through specifying several criteria.  Relying on OCR, or the recognized text can get you in trouble.  First of all, you are assuming that the document will alwyas have recognized text, and that all the items that you are searching for are in the text.  Secondly, depdning on the type of OCR format you have, you may have to just find the document, and then open and parse what you are looking for.  This can also lead to false positives in retrieval if many documents have the same terms in their OCR text.

