Patronus AI’s Judge-Image wants to keep AI honest — and Etsy is already using it

Patronus AI has unveiled a groundbreaking new tool known as the multimodal large language model-as-a-judge (MLLM-as-a-Judge), which is designed to assess AI systems that interpret images and generate text. This innovative evaluation technology aims to detect and address hallucinations and reliability issues in multimodal AI applications.
One of the early adopters of this technology is the popular e-commerce platform Etsy. Etsy has integrated the MLLM-as-a-Judge tool to verify the accuracy of image captions for products listed on its marketplace, which features a wide array of handmade and vintage goods.
Anand Kannappan, the cofounder of Patronus AI, expressed his excitement about Etsy being one of their initial customers. He emphasized the importance of ensuring that the auto-generated image captions are correct across the global user base of Etsy, which has millions of products listed on its platform.
Patronus chose to build its first MLLM-as-a-Judge, named Judge-Image, using Google’s Gemini model after conducting thorough research comparing it with alternatives like OpenAI’s GPT-4V. The company found that Gemini exhibited less bias and a more equitable approach to judging different input-output pairs, making it the preferred choice for their evaluation tool.
The Judge-Image tool offers evaluators that assess image captions based on various criteria, including hallucination detection, object recognition, object location accuracy, and text analysis. This tool is not limited to e-commerce applications and can benefit marketing teams, law firms, and other enterprises that deal with document processing.
As AI continues to play a crucial role in businesses, many companies face the decision of whether to build or buy evaluation tools. Kannappan highlighted the strategic and economic advantages of outsourcing AI evaluation, particularly for complex multimodal systems where failures can occur at multiple stages.
Patronus offers different pricing tiers, starting with a free option for users to experiment with the platform. Customers can pay for evaluator usage based on their needs or opt for enterprise arrangements with custom features and pricing. Despite using Google’s Gemini model, Patronus positions itself as complementary rather than competitive with foundational model providers like Google and OpenAI.
Looking ahead, Patronus plans to expand its evaluation capabilities beyond images into audio assessment. This aligns with the company’s vision of scalable oversight for multimodal AI systems, aiming to develop tools that can keep pace with the increasing sophistication of AI technologies.
In a rapidly evolving landscape of AI deployment, the need for specialized tools that can impartially assess complex multimodal AI systems is becoming increasingly apparent. Patronus is at the forefront of providing these digital judges, which may prove invaluable in ensuring the accuracy and reliability of AI systems in various industries.