Open Source OCR and PDF compression at the Internet Archive

Open Source OCR and PDF compression at the Internet Archive

May 17, 2024 from 1:40 pm to 2:10 pm

Speaker: Merlijn Wajer

Archive PDF tools is the software used to generate highly compressed PDFs (with selectable text layers) at the Internet Archive based on digitized content (books, mostly). The same techniques can easily be applied to any archives one might have at home. This talk will cover the creation of the Internet Archive PDF tools, technical challenges encountered, and discuss the underlying algorithms.