OCR (Appendix)

By default, CROSSCAP Enterprise uses the open-source engine Tesseract for text recognition (Tesseract is developed and maintained by GoogleCode). Optionally, you may use the commercially available Abbyy FineReader engine, however, this will require appropriate licensing (please contact our sales department for details).

Scanning considerations

You will achieve best text recognition results when using source documents with few or no illustrations. Ideally, such documents should then be scanned with the highest feasible resolution and in bi-tonal/monochrome mode.

OCR default settings

Before first use, you should make all necessary OCR default settings (see chapter Main menu (server), section Administration toolbar).

Continuous text recognition (Full Text OCR)

Text recognition may be performed for the entire area of all images scanned, i.e. any recognizable text will be processed. In this case, all the necessary OCR settings are made in the export settings of the desired output format.

The following export formats allow for full-text OCR:

PDF file

TXT file

Word file (available only when using the Abbyy Finereader OCR engine!)

XML file

Localized text recognition (Zonal OCR)

You may also configure the text recognition engine to process specific areas within images. This is referred to as zonal OCR and may be applied in two different ways:

You may automatically apply zonal OCR, e.g. for creating index data. All settings required will need to be made prior to the start of a project. Please find detailed information on this in chapter Templates (server), in the OCR section on Image processing.

Alternatively, you may perform zonal OCR manually, during the course of a project. Recognized text will be placed in the Windows clipboard and may then be transferred to other applications, for further processing. Please find detailed information on this in the CROSSCAP Scan-Client manual.

Image processing functions affecting or supporting OCR

The following image processing functions will improve or affect OCR results.

Please refer to respective sections, for more information:

Color replacement - use to remove background colouring.

Deskew - use to re-align text, so that it is truly perpendicular.

Line removal - use to get rid of interfering lines or frames.

Punch hole removal - use to get rid of interfering punch-holes.

Despeckle - use to get rid of interfering smudges and speckles.