Main Concept
With Amazon Bedrock you can also use human workers to give input to the model evaluation process. They could be an AWS Managed work team, or you can bring you own work team, than means employees of your company or a group of Subject-Matter Experts (SMEs) from your industry. Instead of doing automatically or using a judge model, they will look for the benchmark answers and the generated answers, and tell if they are correct, or not.
Context
Some models of certain industries should be evaluated by a real humans, specially on critical domains (healthcare, legal, etc.) where technical metrics are not enough. Humans can see details and context that automated process can miss.
Key Aspects
- Evaluation Methods
- Thumbs up/down (binary)
- Ranking (comparing multiple responses)
- Detailed scoring/rubrics (more granular evaluation)
- Task types:
- You can also choose for built-in task types (similarly to automatic evaluation)
- You can also add your own custom task (because at the end you have human evaluating it.)
- Advantages:
- Humans detect nuance and context
- It could detect bias and fairness issues
- It also can validate business perspective
Reflexion Questions
- Does this require more time than automatic evaluation?
- Is this more expensive because you need human labor costs?
Conclusions
- Although automated evaluation can help us conduct a faster preliminary assessment of a model, for certain types of cases, it cannot be substituted for human judgment. On the other hand, adding a human to the loop makes it slower and introduces subjectivity. It’s a matter of trade-offs and depends on the use case and results we need.: