Amazon’s Objective to Improve Evaluation of AI Models and Promote Human Engagement
During the AWS re: Invent conference, Swami Sivasubramanian, the AWS vice president in charge of database, analytics, and machine learning, unveiled Model Evaluation on Bedrock. This new feature, currently in the preview phase, targets models stored within the Amazon Bedrock repository. The lack of a transparent model testing mechanism might result in developers choosing models that are not sufficiently accurate for certain projects like question-and-answer systems or ones that are overly large for their intended use cases.
Sivasubramanian stressed the iterative nature of model selection and evaluation, stating that “Model evaluation is an ongoing process rather than a one-time event.” Acknowledging the importance of human involvement in this process, AWS is offering a streamlined approach to handle human evaluation workflows and monitor model performance metrics.
In a conversation with The Verge, Sivasubramanian pointed out how developers often struggle with the decision of opting for larger, more powerful models assuming they would meet their needs, only to realize later that a smaller model would have sufficed.
The Model Evaluation feature consists of two main components: automated evaluation and human evaluation. With the automated option, developers can navigate to their Bedrock console to choose a model for assessment. They can then assess the model’s performance across metrics like robustness, accuracy, or toxicity, especially for tasks such as summarization, text classification, question answering, and text generation. Notably, Bedrock integrates well-known third-party AI models like Meta’s Llama 2, Anthropic’s Claude 2, and Stability AI’s Stable Diffusion.
While AWS provides standardized test datasets, customers also have the flexibility to incorporate their proprietary data into the benchmarking platform, allowing them to gain deeper insights into the model’s operations. Subsequently, a detailed report is generated.
For situations involving human evaluators, users can choose to collaborate with either the AWS human evaluation team or their in-house team. Clients need to specify the task type (e.g., summarization or text generation), evaluation metrics, and the dataset for evaluation. AWS will provide customized pricing and timelines for clients engaging with its evaluation team.
Vasi Philomin, the AWS vice president overseeing generative AI, emphasized the importance of understanding model performance to make more informed development choices. This approach not only helps companies in evaluating whether models adhere to responsible AI standards, such as sensitivity to toxicity levels, but also aids in selecting the most appropriate model for their requirements.
Additionally, Sivasubramanian highlighted that human evaluators can identify additional metrics beyond the capabilities of automated systems, such as empathy and friendliness.
AWS clarified that while the benchmarking service is in its preview phase, customers will only be charged for the model inference utilized during the evaluation process.
Even though there is no universal benchmarking standard for AI models, certain metrics are widely accepted across different industries. Philomin emphasized that the main goal of benchmarking on Bedrock is not to conduct extensive model evaluations but rather to provide companies with a way to assess a model’s impact on their specific projects.