Читать книгу Archives in the Digital Age - Abderrazak Mkadmi - Страница 23

1.2.2.2. Document pre-processing

Оглавление

After passing the documents through a scanner, the result is always a file in an image format. The nature of these images depends on the scanned original documents and on the subsequent processing. These images can be, according to requirements, in black and white (or converted to black and white), in dark or light gray or in color. Color images can be 8, 16, 24, 30 or 36 bits. Each time the resolution increases, the clarity and size of the image increases.

Several types of processing can be provided to be able to exploit the digitized documents:

 – Compression: It consists of reducing the size of files, thus reducing the space used on archiving media and facilitating their circulation on networks. Several compression methods exist, depending on the scanning method and the nature of the original documents:- CCITT2 G3/G4 compression, also known as “G4” or “modified reading”, is a lossless image compression method used in Group 4 facsimile machines, as defined in the ITU-T T.63 fax standard. It is only used for bitonal (black and white) images. Group 4 compression is available in many proprietary image file formats, as well as in standard formats such as TIFF (Tagged Image File Format), CALS (Computer-aided Acquisition and Logistics Support), CIT (Combined interrogator transponder, Intergraph Raster Type 24) and PDF (Portable Document Format),- JBIG4 (Joint Bi-level Image Group) compression: this is a two-level compression of an image, in which a single bit is used to express the color value of each pixel. This standard can also be used to code grayscale images and color images with a limited number of bits per pixel. JBIG is designed for images sent using facsimile coding and offers significantly higher compression than Group 3 and 4 facsimile coding,- the JPEG5 algorithm (Joint Picture Expert Group) is used to reduce the size of color images. This format of graphic file allows very important compression rates, but with a weak resolution that influences the quality of the image: the compression entails a loss of information;

 – Optical Character Recognition (OCR): The purpose of OCR is to convert text in image format into a computer-readable text format by translating the groups of dots in a scanned image into characters with the associated formatting. It is carried out by dedicated systems called “OCR”. The challenge today is to find the most efficient OCR among several tools of this type and the best suited to its application. Among the criteria for the choice of the tool, we often evoke the criterion of effectiveness, which is related to a high recognition rate. The objective to be reached is a rate of 100%. However, the recognition rate does not depend solely on the recognition engine, but also on several other measures to be taken into consideration, such as the material preparation of the paper document upstream and the performance of the OCR engine in the parameters used to adapt to the type of content, taking into account, inter alia, the language, quality and layout of the document.

OCR can be applied within an ERM system in two ways:

1 1) Application on whole pages in text in order to index them in full text using spell checkers.

2 2) Application on some areas within the pages (such as titles) in order to use them as an index. Different technologies have existed for a long time and are based on OCR techniques to extract information from these digitized documents and enrich their metadata (category, author, title, date, etc.):- Automatic Document Recognition (ADR), which consists of distinguishing one type of document from another, according to a few pre-defined parameters. This will make it possible to sort images electronically;- Automatic document reading: this technology uses artificial intelligence technologies to perform linguistic checks on recognized words and interpret them using text-mining functions, for the purpose of pre-analysis and/or thematic classification of the scanned documents.

In addition, this OCR technology is always limited and depends on the quality of the text to be scanned (if it is distorted, faded, stained, folded, contains handwritten annotations, etc.) and on the quality of the scan itself. It often generates several interpretation errors that require human intervention to be corrected, otherwise raw OCR makes it impossible for the text to be read and indexed by search engines. This is why this work is generally outsourced to service providers who use low-cost labor or Internet users (in the absence of financial means). The latter alternative, which is increasingly used by library and archive services, is called crowdsourcing. Several OCR projects have been developed through this alternative with regard to the correction of digitized newspaper texts for the National Library of Australia, the correction of OCR through gamification for the National Library of Finland and the involuntary correction of OCR via reCAPTCHA for the Google Books service, among other projects [AND 17].

Archives in the Digital Age

Подняться наверх