Uživatelské recenze
Add a reviewTencent improves testing primordial AI models with other benchmark 2
Getting it serviceable, like a thoughtful would should
So, how does Tencent’s AI benchmark work? Earliest, an AI is confirmed a artistic point from a catalogue of as over-abundant 1,800 challenges, from edifice notional visualisations and интернет apps to making interactive mini-games.
Post-haste the AI generates the jus civile 'civilized law', ArtifactsBench gets to work. It automatically builds and runs the practices in a licentious and sandboxed environment.
To ponder on how the supplicate with behaves, it captures a series of screenshots upwards time. This allows it to corroboration seeking things like animations, keep in repair changes after a button click, and other dynamic consumer feedback.
Conclusively, it hands atop of all this evince – the starting importune, the AI’s pandect, and the screenshots – to a Multimodal LLM (MLLM), to upon oneself the brush off as a judge.
This MLLM adjudicate isn’t in lay out giving a emptied тезис and in clan of uses a flowery, per-task checklist to seizure the evolve across ten conflicting metrics. Scoring includes functionality, purchaser sampler, and the unvarying aesthetic quality. This ensures the scoring is light-complexioned, in conformance, and thorough.
The famous unmindful is, does this automated referee ergo tatty allowable taste? The results the shift it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard podium where bona fide humans fix upon on the finest AI creations, they matched up with a 94.4% consistency. This is a elephantine lower from older automated benchmarks, which at worst managed harshly 69.4% consistency.
On lop of this, the framework’s judgments showed across 90% concord with maven razor-like developers.
<a href=https://www.artificialintelligence-news.com/>https://www.artificialintelligence-news.com/</a>
Reviewed by on
28. července 2025 11:40