GLAM/Case studies/National Library of France

Wikimedia France signed an agreement with the Bibliothèque nationale de France in 2010 to provide the French Wikisource with 1400 public domain texts that had been prepared for the library's Gallica web resource. As explained in the 7 April announcement, the automatic OCR process used to digitize the material is prone to frequent errors in such old texts, and the quality of the transcriptions will benefit from human proofreading by the Wikisource volunteers.

As a result, a team of three volunteers from Wikimédia France then retrieved high-resolution image files (in the lossless but bulky TIFF format) and OCR files from the BnF, and produced DjVu files that were uploaded on Wikimedia Commons in July 2010, but the heavy compression used in conversion of image files to DjVu resulted in a substantial loss of quality. Subsequently,ll of the original, high-resolution TIFF files were uploaded on Wikimedia Commons at the end of August.

The BnF's OCR files, which indicate the position of each word and all graphical elements such as illustrations in the books, allowed extraction of more than 22,000 image files, although many of them may be of limited interest (detection errors, mere black lines, stamps) or duplicates, and thus require human review before a mass-upload to Wikimedia Commons. Nonetheless, many interesting images, such as educational diagrams, novel illustrations, scientific schematics, portraits, and maps, were obtained. User:BnF import added the files to Commons, while human editors made pages for the works on Wikisource.

For a completed example, see the text Germaine, by Edmond About (images here).

Adapted from Wikipedia:Wikipedia Signpost/2010-04-12/News and notes and Wikipedia:Wikipedia Signpost/2010-09-13/News and notes.