<p>Researchers from METR have shown: the longer the task, the higher the probability of AI agent failure. Success exponentially decreases with task length. Each model has its own "half-life" - the time during which the probability of success drops to 50%. For Claude 3.7 Sonnet, this is 59 minutes (at 50% success) or 15 minutes (at 80%).</p>
<p>Importantly, this "half-life" characteristic allows predicting the time it will take for agents to reach the desired level of reliability. For example, a task lasting 1 hour with 99% success will only become feasible in 4 years if the current pace of progress (doubling capabilities every 7 months) continues.</p>
<p>Interestingly, humans behave differently according to this model — they are less prone to "exponential decay" on long tasks. This indicates a key difference in strategies: humans have self-correction mechanisms that current agents lack.</p>
<p>A key takeaway: until an agent learns to "correct itself on the fly," its scalability is limited.</p>
📝 Paper: https://arxiv.org/abs/2503.14499
