Need to let loose a primal scream without collecting footnotes first? Have a sneer percolating in your system but not enough time/energy to make a whole post about it? Go forth and be mid: Welcome to the Stubsack, your first port of call for learning fresh Awful youāll near-instantly regret.
Any awful.systems sub may be subsneered in this subthread, techtakes or no.
If your sneer seems higher quality than you thought, feel free to cutānāpaste it into its own post ā thereās no quota for posting and the bar really isnāt that high.
The post Xitter web has spawned soo many āesotericā right wing freaks, but thereās no appropriate sneer-space for them. Iām talking redscare-ish, reality challenged āculture criticsā who write about everything but understand nothing. Iām talking about reply-guys who make the same 6 tweets about the same 3 subjects. Theyāre inescapable at this point, yet I donāt see them mocked (as much as they should be)
Like, there was one dude a while back who insisted that women couldnāt be surgeons because they didnāt believe in the moon or in stars? I think each and every one of these guys is uniquely fucked up and if I canāt escape them, I would love to sneer at them.
(Credit and/or blame to David Gerard for starting this.)
Yeah a lot of word choices and tone makes me think snake oil (just from the introduction: "They are now on the level of PhDs in many academic domains "⦠no actually LLMs are only PhD level at artificial benchmarks that play to their strengths and cover up their weaknesses).
But itās useful in the sense of explaining to people why LLM agents arenāt happening anytime soon, if at all (does it count as an LLM agent if the scaffolding and tooling are extensive enough that the LLM is only providing the slightest nudge to a much more refined system under the hood). OTOH, if this ābenchmarkā does become popular, the promptfarmers will probably get their LLMs to pass this benchmark with methods that donāt actually generalize like loads of synthetic data designed around the benchmark and fine tuning on the benchmark.
I came across this paper in a post on the Claude Plays Pokemon subreddit. I donāt know how anyone can watch Claude Plays Pokemon and think AGI or even LLM agents are just around the corner, even with extensive scaffolding and some tools to handle the trickiest bits (pre-labeling the screenshots so the vision portion of the models have a chance, directly reading the current state of the team and location from RAM) it still plays far far worse than a 7 year old provided the 7 year old can read at all (and numerous Pokemon guides and discussion are in the pretraining so it has yet another advantage over the 7 year old).