Earlier this year we covered the release of the Mueller Report covering the Special Counsel's investigation into Russian interference in the 2016 US election. We highlighted how, in the digital age, the choices made in releasing that document made it rather a sad addition to the transcript of history.
There are, of course, many other investigations, and thus, other documents. Two newly-released PDF documents have also acquired a high profile, so it seems reasonable to take a look there as well. This article covers the transcript of Trump's call with the President of Ukraine.
This five page PDF document posted on the White House's website was scanned from paper by a Savin MP C4503 multifunction device at 07:41 am on Wednesday, September 25. As anyone can see, the scanner was really dirty, generating a lot of noise. CLEAN YOUR SCANNER! It's not just to make it prettier and quicker to download. Clean scans OCR much, much better, improving searchability, text-extraction, compression and much more.
C'mon, people! The nation's security could depend on the accuracy of a search performed on a document scanned by one of these filthy things! 🙂
LESSON: Keep it digital and everything is easier for every user, forever.
As with the Mueller Report, from a technical point of view it was entirely unnecessary to scan the document. The PDF could have been "born digital" – created directly from the word-processor on which it was produced, making it automatically searchable, reusable, accessible and more. The subsequent additions to the original document (for example, the classification information at the top and bottom of page 1, and the red "Unclassified" stamps throughout) could have been handled digitally, without printing or scanning at all.
What did printing and scanning add to this document's workflow? It allowed the use of a felt pen to draw little black marks through "EYES ONLY" and "DO NOT COPY" and some other text.
This markup could have also been added to the PDF directly using digital markup tools available in any desktop PDF editor.
LESSON: OCR is always second-best. If imaging (scanning) is a must, the scanner must be clean!
Unlike the (first edition) of the Mueller Report, the PDF file posted by the White House was OCRed before they posted it, so it was text-searchable from the outset… sort of.
Due to the dirty scanner and their chosen workflow - involving rubber stamps and felt-pens - the OCR results are quite poor.
him having your trust and y9ur .confidence and _ have persona·1
relations·with you so we c?n cooperate even ?ore so. I· wili.
personally tell you that one · of my assistants · spoke with Mr.
Giuliani just.recently and we are hoping very much that Mr.
G1uliani will be able to travel to Ukraine and. we will meet once
he co?es to Ukraine.
NOTE: This scan was compressed with JBIG2 technology which has the effect of smoothing burrs in the raw image. This is why the letters, close up, don't look like letters in raw scans, but have unnaturally "smooth" edges.
Perhaps the government could consider creating such records directly from voice-recognition software, embedding the audio file and digitally signing the resulting PDF as an authoritative, archival record of the call.
It's been reported that this document was produced at least in part by voice-recognition software. Perhaps so, but it's also clear from the document's header and footer that it was produced in a word-processor, and thus the voice-recognition output could have been edited prior to printing. There's nothing about this PDF to demonstrate that it is a complete or accurate rendering of the phone-call it's intended to document.
As with the Mueller Report (first edition), this PDF document is untagged and thus inaccessible to users with disabilities, and therefore in violation of Section 508 of the Rehabilitation Act. As with the Mueller Report, this is a casualty of printing and scanning.
Nor does this PDF claim conformance with ISO 19005 (PDF/A), so it's not necessarily archival quality.
No other PDF features are used by this document. It has no metadata, no outlines, no annotations, no tags, no attached audio recording of the call. It's not digitally-signed to protect against tampering. It can't deliver a good experience on a mobile phone.
In the digital age, PDF may be the best vehicle for distributing 5 images of printed pages, but there's a lot more that can be done with a high-quality digital document, preferably without printing and scanning it!
Do you want to learn more about PDF, the digital document format you use everyday? PDF Association members are available world-wide to provide software, services and knowledge that help businesses leverage the world's preferred digital document format.
Join our newsletter, read up on PDF, come to a PDF Association event such as PDF Days Europe 2020 or ask questions in one of the LinkedIn groups discussing PDF technology. The PDF industry is looking forward to hearing from you!
Duff serves the PDF industry as ISO Project co-Leader and US TAG chair for both ISO 32000 (the PDF specification) and ISO 14289 (PDF/UA). As Executive Director of the PDF Association, Duff coordinates several working groups, speaks at a wide variety of industry events and promotes the advancement and adoption of PDF technology worldwide. An independent consultant, Duff Johnson is a veteran …
Duff serves the PDF industry as ISO Project co-Leader and US TAG chair for both ISO 32000 (the PDF specification) and …