Programmers formulate prompts for code generation based on their understandings of problem domain and LLM ability to execute instructions. A deficiency in either understanding yields inadequate generated code, requiring programmers to revise their understandings based on deficiencies observed in the generated code. We propose ReFiQ (``Result-first Queries''), a tool that simultaneously offers concrete run-time execution results of multiple generated code variations for one prompt. In an exploratory user study (n=8), we observed that programmers systematically compared those results to identify issues in their understandings, deliberately formulated single prompts to explore a range of options of a domain concept, and were more often encouraged to make a deliberate decision informed by their observations instead of directly applying generated code. Participants in our study required fewer iterations to arrive at a satisfying solution in ReFiQ than in our baseline, GitHub Copilot.
ACM CHI Conference on Human Factors in Computing Systems