Finding the Conversation: A Method for Scoring Documents for Natural Conversation Content

要旨

With generative AI acquiring the right training data is a critical part of designing the user experience. Training large language models to talk like humans requires exposing them to the interaction patterns distinctive of natural conversation. Although models are typically fine-tuned on question-answer or instruction pairs, they are less often trained on real-time human conversations. Natural conversation data are hard to find and "conversation" is used to mean very different kinds of interaction or content. We demonstrate a method for scoring language content using \textit{generic conversational phrase detection}. We generate three scores: 1) range of unique features, 2) density of features within sections of the content, and 3) overall score combining these. Using our method, we score over 27,000 documents from 6 datasets, which vary widely in terms of whether or not they contain conversation content. Our results show this approach is effective in distinguishing conversation content from non-conversation and from conversation-like content.

著者
Robert Moore
IBM Research, San Jose, California, United States
Sungeun An
IBM Research, San Jose, California, United States
Jay Pankaj Gala
IBM Research, San Jose, California, United States
Divyesh Jadav
IBM Research, San Jose, California, United States
DOI

10.1145/3706598.3714401

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714401

動画

会議: CHI 2025

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2025.acm.org/)

セッション: Fabrication and Interaction Tools

G402
7 件の発表
2025-05-01 01:20:00
2025-05-01 02:50:00
日本語まとめ
読み込み中…