Asana's commitment to leveraging the latest and most powerful AI models requires rigorous testing to ensure their features provide reliable insights for enterprise customers. Asana has built a robust and high-throughput Quality Assurance (QA) process, reflecting real-world collaborative work management use cases.
Asana's LLM Foundations team has created an in-house unit testing framework that enables engineers to test LLM responses during development, similar to traditional unit testing. The framework allows for the testing of assertions through calls to LLMs, ensuring accurate responses. Asana uses this method for testing the model's ability to capture key details of a task, such as the launch date, and for its "needle-in-a-haystack" testing.
Asana's AI-powered features often involve chaining multiple prompts, including tools selected agentically. Integration testing in Asana's LLM testing framework helps assess these chains before feature release, ensuring the ability to retrieve necessary data and generate accurate user-facing responses.
End-to-end testing at Asana utilizes realistic data in sandboxed test instances of Asana, mimicking the actual customer experience. This method, while time-consuming, provides a deeper understanding of model performance in real-world scenarios and allows for the assessment of nuanced aspects of "intelligence" that are harder to quantify, such as writing style and "connecting the dots."
Asana's testing process for pre-production models includes additional assessments to evaluate their performance and capabilities. This includes rapid performance statistics generation, measuring TTFT (time-to-first-token) and TPS (tokens-per-second), as well as testing the agentic capabilities of the model.
Asana partnered with Anthropic to test their pre-release models, including Claude 3.5 Sonnet. They performed a series of tests to evaluate the model's performance, agentic reasoning, and answer quality.
Performance is a key factor for Asana's AI teammates, and Claude 3.5 Sonnet showed significant improvements over its predecessors, Claude 3 Sonnet and Claude 3 Opus, in terms of TTFT. The model's TTFT was found to be competitive with the lowest-latency frontier models in Asana's testing set. This improvement translates to a much faster response time for users, enhancing their experience with Asana's AI features.
Claude 3.5 Sonnet demonstrated significant improvement in its ability to act as an agent, successfully executing longer and more complex workflows. Asana's multi-agent prototyping platform was used to evaluate the model's agentic reasoning, showcasing its ability to follow complex instructions and complete multi-step tasks.
Asana's "smart answers" feature relies on LLMs to answer questions based on data accessible across the Asana organization. The tests conducted on Claude 3.5 Sonnet focused on evaluating its ability to extract key insights from long contexts and provide accurate and comprehensive answers.
Anthropic's Claude 3.5 Sonnet represents a significant advancement in their model offerings, showcasing improvements in performance, reasoning, and writing quality. Asana has implemented a rigorous testing process to ensure that these improvements translate to enhanced user experience and reliable insights for their customers. The future of work management is likely to be powered by advanced AI models, and Asana's dedication to testing ensures that they stay at the forefront of this evolving landscape.
Ask anything...