Evaluating large language models in business

The overview box at the top offers a quick summary: It displays the inputs (competing models and evaluation dataset) and the overall result, known as the win rate. The win rate, expressed as a percentage, indicates how often Health Snacks LLM provided better summaries compared to Model B. To empower you to get a deeper understanding of the win rate, we also show the row-based results, accompanied by the autorater’s explanations and confidence scores. By aligning the results and explanations with your expectations, you can evaluate the autorater’s accuracy for your use case. For each input prompt, the responses from the Health Snack LLM and Model B are listed. The response marked green with a winner icon in the top left corner represents the better summarization according to the autorater. The explanation provides the rationale behind the selection, and the confidence score indicates the autorater’s certainty in their choice based on self-consistency.

In addition to the explanations and confidence scores, you can choose to calibrate the autorater to gain greater confidence in the accuracy and reliability of the overall evaluation process. To do this, you need to provide human-preference data directly to AutoSxS, which outputs alignment-aggregate statistics. These statistics measure the agreement between our autorater and the human-preference data, accounting for random agreement (Cohen’s Kappa).

While incorporating human evaluation is crucial, collecting human evaluations remains one of the most time-consuming tasks in the entire evaluation process. That is why we decided to integrate human review as the third family of methods of the GenAI Evaluation framework. And we partner with third-party data-centric providers such as LabelBox to help you easily access human evaluations against a wide range of tasks and criteria.

In conclusion, Gen AI Evaluation Service provides a rich set of methods, accessible through diverse channels (online and offline), that you can use with any LLM model to build your customized evaluation framework for efficiently assessing your GenAI application.

How Generali Italia used the Gen AI Evaluation Service to productionize a RAG-based LLM Application

Generali Italia, a leading insurance provider in Italy, was one of the first users of our Gen AI Evaluation Service. As Stefano Frigerio, Head of Technical Leads, Generali Italia, says:

“The model eval was the key to success in order to put a LLM in production. We couldn’t afford a manual check and refinement in a non-static ecosystem.”

Similar to other insurance companies, Generali Italia produces diverse documents, including policy statements, premium statements with detailed explanations and deadlines and more. Generali Italia created a GenAI application, leveraging retrieval-augmented generation (RAG) technology. This innovation empowers employees to interact with documents conversationally, expediting information retrieval. However, the application’s successful deployment would not be possible without a robust framework to assess both its retrieval and generative functions. The team started by defining the dimensions of performance that mattered to them and used the Gen AI Evaluation Service to measure performance against their baseline.

According to Ivan Vigorito, Tech Lead Data scientist at Generali Italia, Generali Italia decided to use the Vertex Gen AI Evaluation Service for several reasons: The evaluation method AutoSxS gave Generali Italia access to autoraters that emulated human ratings in assessing the quality of LLM responses. This reduced the need for manual evaluation, saving both time and resources. Furthermore, the Gen AI Evaluation Service allowed the team to perform evaluations based on predefined criteria, making the evaluation process more objective. Thanks to explanations and confidence scores, the service helped make model performance understandable and guided the team to avenues for improving their application. Finally, the Vertex Gen AI Evaluation Service helped the Generali team in evaluating any model by using pre-generated or external predictions. This feature has been particularly useful for comparing outputs from models not hosted in Vertex AI.

According to Dominico Vitarella, Tech Lead Machine Learning Engineer in Generali Italia, the Gen AI Evaluation Service not only saves time and effort for their team when it comes to evaluation. It is seamlessly embedded within Vertex AI and allows the team access to a comprehensive platform for training, deploying, and evaluating both generative and predictive applications.

Conclusion

Are you struggling to evaluate your GenAI application?

The Gen AI Evaluation Service empowers you to build an evaluation system that works for your specific use case with our set of quality controlled and explainable methods over different modes. The Gen AI Evaluation Service helps you find the right model for your task, iterate faster and deploy with confidence.