With the emergence of ChatGPT and other large language models (LLMs), evaluating their performance has become crucial. MLflow provides a powerful API, mlflow.evaluate()
, to help evaluate LLMs, including ChatGPT models, in a structured and efficient manner.
This quickstart example demonstrates how to evaluate a simple question-answering model built by wrapping the "openai/gpt-4" model with a custom prompt.
import mlflow
import openai
# Set up evaluation data
eval_data = pd.DataFrame({...})
# Log the GPT-4 model with MLflow
with mlflow.start_run():
logged_model_info = mlflow.openai.log_model(...)
# Evaluate the model using pre-defined question-answering metrics
results = mlflow.evaluate(
logged_model_info.model_uri,
eval_data,
targets="ground_truth",
model_type="question-answering",
)
# Print evaluation results
print(f"Aggregated results: \n{results.metrics}")
print(f"Per-row results: \n{results.tables['eval_results_table']}")
MLflow offers two types of LLM evaluation metrics:
There are two ways to select metrics for evaluating your ChatGPT or LLM model:
model_type="question-answering"
).extra_metrics
.MLflow offers pre-canned metrics that use LLMs as judges, such as mlflow.metrics.genai.answer_similarity()
and mlflow.metrics.genai.answer_correctness()
. These metrics can be used in the extra_metrics
argument of mlflow.evaluate()
.
By default, LLM-as-judge metrics use openai:/gpt-4
as the judge. You can change the default judge model by passing an override to the model
argument within the metric definition, including local deployments or Databricks endpoints.
To evaluate your ChatGPT or LLM model with mlflow.evaluate()
, it must be one of the following types:
PyFuncModel
instance or a URI pointing to a logged model.mlflow.pyfunc.PyFuncModel.predict()
.To evaluate your model as an MLflow model, follow these steps:
You can also evaluate a Python function without logging the model to MLflow by passing the function to mlflow.evaluate()
.
For MLflow >= 2.8.0, mlflow.evaluate()
supports evaluating a static dataset without specifying a model. This is useful when you have saved the model outputs in a Pandas DataFrame or MLflow PandasDataset and want to evaluate the static dataset without re-running the model.
mlflow.evaluate()
returns the evaluation results as an mlflow.models.EvaluationResult
instance. You can access the following:
metrics
: Stores the aggregated results (e.g., average, variance) across the evaluation dataset.tables["eval_results_table"]
: Stores the per-row evaluation results.Your evaluation results are automatically logged into the MLflow server, and you can view them directly from the MLflow UI by following these steps:
You can create your own SaaS LLM evaluation metrics using mlflow.metrics.genai.make_genai_metric()
. This API requires the following information:
You can also create custom traditional metrics by implementing an eval_fn
that defines your scoring logic and returns an mlflow.metrics.MetricValue
instance. Then, pass the eval_fn
and other arguments to mlflow.metrics.make_metric
to create the metric.
For more comprehensive guides, examples, and best practices on evaluating ChatGPT and LLMs using MLflow, refer to the following resources:
Ask anything...