Summary of MLflow LLM Evaluate — MLflow 2.10.2 documentation

  • mlflow.org
  • Article
  • Summarized Content

    Introduction to Evaluating ChatGPT and LLMs with MLflow

    With the emergence of ChatGPT and other large language models (LLMs), evaluating their performance has become crucial. MLflow provides a powerful API, mlflow.evaluate(), to help evaluate LLMs, including ChatGPT models, in a structured and efficient manner.

    • MLflow's LLM evaluation functionality consists of three main components:
      • A model to evaluate (MLflow model, URI, or Python callable)
      • Metrics (LLM metrics or custom metrics)
      • Evaluation data (Pandas DataFrame, NumPy array, or MLflow Dataset)
    • MLflow offers comprehensive notebook guides and examples showcasing the simplicity and power of its LLM evaluation capabilities.

    Quickstart: Evaluating a ChatGPT Question-Answering Model

    This quickstart example demonstrates how to evaluate a simple question-answering model built by wrapping the "openai/gpt-4" model with a custom prompt.

    import mlflow
    import openai
    
    # Set up evaluation data
    eval_data = pd.DataFrame({...})
    
    # Log the GPT-4 model with MLflow
    with mlflow.start_run():
        logged_model_info = mlflow.openai.log_model(...)
    
    # Evaluate the model using pre-defined question-answering metrics
    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    
    # Print evaluation results
    print(f"Aggregated results: \n{results.metrics}")
    print(f"Per-row results: \n{results.tables['eval_results_table']}")
    

    LLM Evaluation Metrics in MLflow

    MLflow offers two types of LLM evaluation metrics:

    Selecting and Using LLM Evaluation Metrics

    There are two ways to select metrics for evaluating your ChatGPT or LLM model:

    • Use default metrics for pre-defined model types (e.g., model_type="question-answering").
    • Use a custom list of metrics by specifying extra_metrics.

    Metrics with LLM as the Judge

    MLflow offers pre-canned metrics that use LLMs as judges, such as mlflow.metrics.genai.answer_similarity() and mlflow.metrics.genai.answer_correctness(). These metrics can be used in the extra_metrics argument of mlflow.evaluate().

    Selecting the LLM-as-Judge Model

    By default, LLM-as-judge metrics use openai:/gpt-4 as the judge. You can change the default judge model by passing an override to the model argument within the metric definition, including local deployments or Databricks endpoints.

    Preparing Your ChatGPT Model for Evaluation

    To evaluate your ChatGPT or LLM model with mlflow.evaluate(), it must be one of the following types:

    • An MLflow PyFuncModel instance or a URI pointing to a logged model.
    • A Python function that takes in string inputs and outputs a single string, matching the signature of mlflow.pyfunc.PyFuncModel.predict().

    Evaluating with an MLflow Model

    To evaluate your model as an MLflow model, follow these steps:

    Evaluating with a Custom Function

    You can also evaluate a Python function without logging the model to MLflow by passing the function to mlflow.evaluate().

    Evaluating with a Static Dataset

    For MLflow >= 2.8.0, mlflow.evaluate() supports evaluating a static dataset without specifying a model. This is useful when you have saved the model outputs in a Pandas DataFrame or MLflow PandasDataset and want to evaluate the static dataset without re-running the model.

    Viewing Evaluation Results

    View Evaluation Results via Code

    mlflow.evaluate() returns the evaluation results as an mlflow.models.EvaluationResult instance. You can access the following:

    • metrics: Stores the aggregated results (e.g., average, variance) across the evaluation dataset.
    • tables["eval_results_table"]: Stores the per-row evaluation results.

    View Evaluation Results via the MLflow UI

    Your evaluation results are automatically logged into the MLflow server, and you can view them directly from the MLflow UI by following these steps:

    Creating Custom ChatGPT Evaluation Metrics

    Create LLM-as-Judge Evaluation Metrics (Category 1)

    You can create your own SaaS LLM evaluation metrics using mlflow.metrics.genai.make_genai_metric(). This API requires the following information:

    • Name of the custom metric
    • Definition of the metric
    • Grading prompt describing the scoring criteria
    • Examples with scores as references for the LLM judge
    • Identifier of the LLM judge (e.g., "openai:/gpt-4" or "endpoints:/databricks-llama-2-70b-chat")
    • Additional parameters for the LLM judge (e.g., temperature)
    • Aggregation options for per-row scores
    • Indicator if a higher score is better

    Create Heuristic-based LLM Evaluation Metrics (Category 2)

    You can also create custom traditional metrics by implementing an eval_fn that defines your scoring logic and returns an mlflow.metrics.MetricValue instance. Then, pass the eval_fn and other arguments to mlflow.metrics.make_metric to create the metric.

    Additional Resources and Examples

    For more comprehensive guides, examples, and best practices on evaluating ChatGPT and LLMs using MLflow, refer to the following resources:

    Discover content by category

    Ask anything...

    Sign Up Free to ask questions about anything you want to learn.