Need to let loose a primal scream without collecting footnotes first? Have a sneer percolating in your system but not enough time/energy to make a whole post about it? Go forth and be mid: Welcome to the Stubsack, your first port of call for learning fresh Awful youāll near-instantly regret.
Any awful.systems sub may be subsneered in this subthread, techtakes or no.
If your sneer seems higher quality than you thought, feel free to cutānāpaste it into its own post ā thereās no quota for posting and the bar really isnāt that high.
The post Xitter web has spawned soo many āesotericā right wing freaks, but thereās no appropriate sneer-space for them. Iām talking redscare-ish, reality challenged āculture criticsā who write about everything but understand nothing. Iām talking about reply-guys who make the same 6 tweets about the same 3 subjects. Theyāre inescapable at this point, yet I donāt see them mocked (as much as they should be)
Like, there was one dude a while back who insisted that women couldnāt be surgeons because they didnāt believe in the moon or in stars? I think each and every one of these guys is uniquely fucked up and if I canāt escape them, I would love to sneer at them.
(Credit and/or blame to David Gerard for starting this.)
Innocuous-looking paper, vague snake-oil scented: Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
Conclusions arenāt entirely surprising, observing that LLMs tend to go off the rails over the long term, unrelated to their context window size, which suggests that the much vaunted future of autonomous agents might actually be a bad idea, because LLMs are fundamentally unreliable and only a complete idiot would trust them to do useful work.
Whatās slightly more entertaining are the transcripts.
You tell em, Claude. Iām happy for you to send these sorts of messages backed by my credit card. The future looks awesome!
I got around to reading the paper in more detail and the transcripts are absurd and hilarious:
And this is from Claude 3.5 Sonnet, which performed best on average out of all the LLMs tested. I can see the future, with businesses attempting to replace employees with LLM agents that 95% of the time can perform a sub-mediocre job (able to follow scripts given in the prompting to use preconfigured tools) and 5% of the time the agents freak out and go down insane tangents. Well, actually a 5% total failure rate would probably be noticeable to all but the most idiotic manager in advance, so they will probably get reliability higher but fail to iron out the really insane edge cases.
Yeah a lot of word choices and tone makes me think snake oil (just from the introduction: "They are now on the level of PhDs in many academic domains "⦠no actually LLMs are only PhD level at artificial benchmarks that play to their strengths and cover up their weaknesses).
But itās useful in the sense of explaining to people why LLM agents arenāt happening anytime soon, if at all (does it count as an LLM agent if the scaffolding and tooling are extensive enough that the LLM is only providing the slightest nudge to a much more refined system under the hood). OTOH, if this ābenchmarkā does become popular, the promptfarmers will probably get their LLMs to pass this benchmark with methods that donāt actually generalize like loads of synthetic data designed around the benchmark and fine tuning on the benchmark.
I came across this paper in a post on the Claude Plays Pokemon subreddit. I donāt know how anyone can watch Claude Plays Pokemon and think AGI or even LLM agents are just around the corner, even with extensive scaffolding and some tools to handle the trickiest bits (pre-labeling the screenshots so the vision portion of the models have a chance, directly reading the current state of the team and location from RAM) it still plays far far worse than a 7 year old provided the 7 year old can read at all (and numerous Pokemon guides and discussion are in the pretraining so it has yet another advantage over the 7 year old).