Motivation:
Group conversations are valuable for second language (L2) learners as they provide opportunities to practice listening and speaking, exercise complex turn-taking skills, and experience group social dynamics in a target language. However, most existing Augmented Reality (AR)-based conversational learning tools focus on dyadic interactions rather than group dialogues. Although research has shown that AR can help reduce speaking anxiety and create a comfortable space for practicing speaking skills in dyadic scenarios, especially with Large Language Model (LLM)-based conversational agents, the potential for group language practice using these technologies remains largely unexplored. We introduce ConversAR, a gpt-4o powered AR application, that enables L2 learners to practice contextualized group conversations. Our system features two embodied LLM agents with vision-based scene understanding and
live captions. In a system evaluation with 10 participants, users reported reduced speaking anxiety and increased learner autonomy compared to perceptions of in-person practice methods with other learners.
System:
ConversAR is an AR, Meta Quest 3 application developed in Unity that enables L2 learners to engage in group conversations with two embodied LLM agents in a target language. The system incorporates advanced voice recognition and response capabilities through integration with the ChatGPT Audio Model (whisper-1) and Chat Model (gpt-4o). Before a conversation begins, the system takes a snapshot of the user’s environment, performing object detection to enable agents to engage the user in contextually relevant dialogue. The agents take turns conversing with the user and each other, with the user being able to interject, answer, or ask questions to either of them at any point. We specifically chose an AR headset form factor as it provides the benefits of immersion and engagement observed in VR conversational systems [34, 40] while uniquely enabling us to interweave context from the users physical world into the experience through scene understanding [22, 30]. Where fully VR systems would be limited by pre-determined 3D rendered environments, by using an AR approach, users can engage in conversations that reflect their surroundings for more relevant L2 practice across diverse contexts. We iteratively developed and refined the system by conducting a pilot study with two adult university undergraduate students enrolled in an intermediate Spanish course. Each pilot tester had a 10-minute conversation using the system and was briefly interviewed, providing insights for improving conversation flow, agent behavior, and interaction design.
Approach:
To evaluate our system, we recruited 10 adult university undergraduate students (4 male, 6 female, aged 18-23) who have taken or are currently taking an intermediate or higher level Spanish course (Appendix A.5). We recruited participants via emails and word of mouth. Each participant was compensated $15 USD after an hour-long study. We conducted the study in a university library, with the agents prompted to discuss favorite books and literary genres. First, we showed participants a 5-minute walk-through video, then each participant engaged in a 10-minute group conversation using the system. We recorded each conversation. Next, participants completed a survey with 7-point Likert scale questions about system usability, their perceptions of the system and task, and their perceptions of in-person L2 group conversations (Appendix A.2). Finally, we conducted a ∼30-minute semi-structured interview about the overall experience, perception of the activity, perception of system features, engagement with system features, and open feedback (Appendix A.1). Following a prior study in L2 task design [29], we measured conversational engagement metrics across behavioral, cognitive, and social dimensions. Behavioral engagement refers to the degree of active involvement in the task, cognitive engagement refers to sustained attention and mental effort, and social engagement refers to a degree of reciprocity and mutual involvement during the task [29]. One researcher, who is a native Spanish speaker, collected these metrics from the conversation recordings to contextualize the interview responses (Appendix A.4). In addition, the researcher gave each participant a score based on the Advanced Placement (AP) interpersonal speaking rubric as a proxy for conversational depth [11] (Appendix A.5). Our university’s Institutional Review Board reviewed and approved our study.