4 Comments
User's avatar
Tony Rifkin's avatar

Yes too, with Zhu, on “we've come a very, very long way”. Takes one’s breath away, eh?... climbing the mountain, and still climbing/scaling up, to an ever higher benchmark. But how? And what will be reached? The top of the mountain, via steps along the path of the token patterns of what is known ... and in that, the ways in which it is known! So with Zhu too, we will have to look at those token steps to see how and what it is that is known. Language is amazing.

Expand full comment
Zhu's avatar

Yes, "we've come a very, very long way" on the evaluation. But shouldn't it evaluate the linguistic abilities in a really fine-grained and well-designed way, rather than just discarding or ignoring it? We know that for tasks of fundemantal NLP, like sentence parsing, WSD, etc, it can't recieve 100% performance yet.

Expand full comment
David Kunin's avatar

So much great detail! Thanks for this work.

Expand full comment
Steeven's avatar

The compute parade reminds me of the mechanicus.

The condensed matter physics dataset is funny. I can barely read the question, let alone answer it.

The muon optimizer appears to be as good as Adam at 1.2B parameter models and the other parameters weren’t tested in the graph showing performance so it’s not clear what the takeaway is, especially since 1.2B is tiny relative to what is used like you said

Expand full comment