Summary of How Asana tests frontier LLMs: our analysis of Claude 3.5 Sonnet • Asana

  • asana.com
  • Article
  • Summarized Content

    Asana's Methodology for LLM Testing

    Asana's commitment to leveraging the latest and most powerful AI models requires rigorous testing to ensure their features provide reliable insights for enterprise customers. Asana has built a robust and high-throughput Quality Assurance (QA) process, reflecting real-world collaborative work management use cases.

    • Their testing process involves various methods, including unit testing, integration testing, end-to-end testing, and additional tests for new models.

    Unit Testing

    Asana's LLM Foundations team has created an in-house unit testing framework that enables engineers to test LLM responses during development, similar to traditional unit testing. The framework allows for the testing of assertions through calls to LLMs, ensuring accurate responses. Asana uses this method for testing the model's ability to capture key details of a task, such as the launch date, and for its "needle-in-a-haystack" testing.

    • This approach ensures high test coverage and performance, allowing for rapid iteration and development.

    Integration Testing

    Asana's AI-powered features often involve chaining multiple prompts, including tools selected agentically. Integration testing in Asana's LLM testing framework helps assess these chains before feature release, ensuring the ability to retrieve necessary data and generate accurate user-facing responses.

    • This type of testing is crucial for features that rely on multiple LLM interactions to provide complete functionality.

    End-to-End Testing

    End-to-end testing at Asana utilizes realistic data in sandboxed test instances of Asana, mimicking the actual customer experience. This method, while time-consuming, provides a deeper understanding of model performance in real-world scenarios and allows for the assessment of nuanced aspects of "intelligence" that are harder to quantify, such as writing style and "connecting the dots."

    • While automated unit and integration testing is invaluable for development speed, human evaluation remains crucial for assessing overall quality and model output.

    Additional Tests for New Models

    Asana's testing process for pre-production models includes additional assessments to evaluate their performance and capabilities. This includes rapid performance statistics generation, measuring TTFT (time-to-first-token) and TPS (tokens-per-second), as well as testing the agentic capabilities of the model.

    • A quantitative benchmark on Asana's custom tool-use extractor is used to measure agentic reasoning, and qualitative testing is conducted using their internal multi-agent prototyping platform.

    Evaluating Claude 3.5 Sonnet

    Asana partnered with Anthropic to test their pre-release models, including Claude 3.5 Sonnet. They performed a series of tests to evaluate the model's performance, agentic reasoning, and answer quality.

    Claude 3.5 Sonnet Performance

    Performance is a key factor for Asana's AI teammates, and Claude 3.5 Sonnet showed significant improvements over its predecessors, Claude 3 Sonnet and Claude 3 Opus, in terms of TTFT. The model's TTFT was found to be competitive with the lowest-latency frontier models in Asana's testing set. This improvement translates to a much faster response time for users, enhancing their experience with Asana's AI features.

    • Claude 3.5 Sonnet exhibited approximately 67% lower TTFT compared to Claude 3 Sonnet.
    • The model's TPS remained approximately equivalent to Claude 3 Sonnet.

    Claude 3.5 Sonnet Agentic Reasoning

    Claude 3.5 Sonnet demonstrated significant improvement in its ability to act as an agent, successfully executing longer and more complex workflows. Asana's multi-agent prototyping platform was used to evaluate the model's agentic reasoning, showcasing its ability to follow complex instructions and complete multi-step tasks.

    • The quantitative tool use benchmark showed a 90% success rate for Claude 3.5 Sonnet, compared to 76% for Claude 3 Sonnet.
    • Qualitative testing revealed that Claude 3.5 Sonnet performed like a true agent, following objectives to completion, unlike Claude 3 Opus, which sometimes short-circuited complex workflows.
    • Asana has switched their default AI workflows agent to Claude 3.5 Sonnet, recognizing its improved agentic reasoning and performance.

    Claude 3.5 Sonnet Answer Quality and Precision

    Asana's "smart answers" feature relies on LLMs to answer questions based on data accessible across the Asana organization. The tests conducted on Claude 3.5 Sonnet focused on evaluating its ability to extract key insights from long contexts and provide accurate and comprehensive answers.

    • Claude 3.5 Sonnet achieved the highest score on Asana's LLM unit testing framework, passing 78% of the tests, matching Claude 3 Opus and surpassing Claude 3 Sonnet's score of 59%.
    • Qualitative assessment, using real-world questions and data from Asana's organization, revealed that Claude 3.5 Sonnet excelled at articulating insights about complex topics, identifying risks, and highlighting key decisions. It provided more accurate and insightful answers compared to previous models.
    • Claude 3.5 Sonnet demonstrated a significant improvement in its ability to handle long contexts and extract relevant information, leading to more accurate and comprehensive answers.

    Takeaways

    Anthropic's Claude 3.5 Sonnet represents a significant advancement in their model offerings, showcasing improvements in performance, reasoning, and writing quality. Asana has implemented a rigorous testing process to ensure that these improvements translate to enhanced user experience and reliable insights for their customers. The future of work management is likely to be powered by advanced AI models, and Asana's dedication to testing ensures that they stay at the forefront of this evolving landscape.

    • Asana's investment in robust QA enables them to rapidly evaluate frontier models and incorporate the best performers into their features.
    • Asana is committed to using the latest and most powerful models to enhance their AI teammates, providing customers with advanced AI workflows and a more efficient work experience.

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.