Yesterday, the ARC Prize launched the third version of their benchmark. The tasks look like visual puzzles on colored grids - you need to see the pattern and apply it. Humans handle it without problems, while the best LLMs and agents still can't score even 1%.
On one hand, we constantly hear that AGI is already here. On the other hand, we see tests where the gap between humans and machines remains huge. I think the emergence of such benchmarks is important to understand the real limits of current networks, and for their creators to know where to develop. The only downside is that developers often start optimizing LLMs not for the benefit of people, but to rank higher on the leaderboard.
Additionally, a prize pool of $2 million has been added for those who can create an agent that can pass the test.
By the way, you can take the test yourself on the website - it all looks like a 90s computer game. Seven steps, and you'll understand why it's easy for humans.