The paper introduces the problem of reference resolution, which is crucial for conversational AI and voice assistants to understand the context of user queries. This includes resolving references to entities on the user's screen, previous conversational turns, and background processes. The authors propose using large language models (LLMs) for this task, as they have shown great potential in various natural language processing (NLP) tasks.
Traditional reference resolution systems have focused on conversational and visual/deictic references but have not explored on-screen references as much. The authors argue that using LLMs for this task can address several challenges:
The authors propose a novel approach to encode on-screen entities as text input for large language models. This involves:
The authors used three types of datasets for training and evaluation:
The proposed model, ReALM, is based on the FLAN-T5 large language model fine-tuned on the reference resolution datasets. The authors compare ReALM's performance against two baselines:
The authors report that ReALM outperforms the MARRS baseline across different types of references, including on-screen entities. ReALM also achieves performance comparable to or better than GPT-4, despite being a much lighter model. Key findings include:
The paper demonstrates the effectiveness of using large language models for reference resolution, enabling context understanding for conversational AI and voice assistants. The proposed approach encodes on-screen entities as textual representations, allowing LLMs to resolve references to various entity types. The authors suggest exploring more complex spatial encoding techniques and grid-based representations for future work.
Ask anything...