Finding the Conversation: A Method for Scoring Documents for Natural Conversation Content

With generative AI acquiring the right training data is a critical part of designing the user experience. Training large language models to talk like humans requires exposing them to the interaction patterns distinctive of natural conversation. Although models are typically fine-tuned on question-answer or instruction pairs, they are less often trained on real-time human conversations. Natural conversation data are hard to find and "conversation" is used to mean very different kinds of interaction or content. We demonstrate a method for scoring language content using \textit{generic conversational phrase detection}. We generate three scores: 1) range of unique features, 2) density of features within sections of the content, and 3) overall score combining these. Using our method, we score over 27,000 documents from 6 datasets, which vary widely in terms of whether or not they contain conversation content. Our results show this approach is effective in distinguishing conversation content from non-conversation and from conversation-like content.

IBM Research, San Jose, California, United States

10.1145/3706598.3714401

https://dl.acm.org/doi/10.1145/3706598.3714401

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2025.acm.org/)

G402

7 件の発表

開始日時2025-05-01 01:20:00

終了日時2025-05-01 02:50:00

読み込み中…

お気に入り

あとで読む

コレクション