Instrumental music conveys rich affective experiences through acoustic cues, yet instrumental passages often remain inaccessible to Deaf and Hard-of-Hearing (DHH) audiences. Although captioning practices for vocal songs have expanded, instrumental music remains largely uncaptioned, with no established criteria for representing musical content in text. We propose 𝜇Cap (Music Captions), an automatic instrumental music captioning system that transforms instrumental audio into time-aligned, non-lexical textual renderings enhanced with simple visuals. Drawing on Preliminary surveys with DHH individuals and expert group discussions, we developed a phonetic-like captioning schema grounded in music sound analysis and linguistics. We then implemented 𝜇Cap using audio feature extraction and a Retrieval-Augmented Generation(RAG) pipeline to produce expressive, sound-mimetic captions. Two user evaluations with DHH participants (n=20 and n=15) showed that 𝜇Cap enhanced music appreciation, immersion, and perceived presence of acoustic detail. This work contributes empirical evidence and insights for designing caption-based visual representations that make instrumental music more accessible.
ACM CHI Conference on Human Factors in Computing Systems