Most academic research is ultimately disseminated through documents in the PDF format. This format has advantages in flexibility and portability, but presents challenges for accessibility that have stubbornly resisted solutions despite decades of attempts. Tagging PDFs is hard to automate because tags are currently generated visually, not semantically, which makes the output cluttered and manual correction tedious and error-prone. Ironically, this semantic structure already exists during authoring but is discarded during PDF rendering. This raises an obvious question, can we use this lost semantic information to better automate tagging in PDFs? In this paper, we develop iTagPDF, a system that refines generated metadata using the semantics in the source documents of research papers. We demonstrate that the metadata generated by iTagPDF already surpasses what authors currently submit to ACM conferences on many criteria. Our approach represents a concrete step toward finally automating accessibility remediation in research paper PDFs.
ACM CHI Conference on Human Factors in Computing Systems