Yes too, with Zhu, on “we've come a very, very long way”. Takes one’s breath away, eh?... climbing the mountain, and still climbing/scaling up, to an ever higher benchmark. But how? And what will be reached? The top of the mountain, via steps along the path of the token patterns of what is known ... and in that, the ways in which it is known! So with Zhu too, we will have to look at those token steps to see how and what it is that is known. Language is amazing.
Yes, "we've come a very, very long way" on the evaluation. But shouldn't it evaluate the linguistic abilities in a really fine-grained and well-designed way, rather than just discarding or ignoring it? We know that for tasks of fundemantal NLP, like sentence parsing, WSD, etc, it can't recieve 100% performance yet.
The condensed matter physics dataset is funny. I can barely read the question, let alone answer it.
The muon optimizer appears to be as good as Adam at 1.2B parameter models and the other parameters weren’t tested in the graph showing performance so it’s not clear what the takeaway is, especially since 1.2B is tiny relative to what is used like you said
Yes too, with Zhu, on “we've come a very, very long way”. Takes one’s breath away, eh?... climbing the mountain, and still climbing/scaling up, to an ever higher benchmark. But how? And what will be reached? The top of the mountain, via steps along the path of the token patterns of what is known ... and in that, the ways in which it is known! So with Zhu too, we will have to look at those token steps to see how and what it is that is known. Language is amazing.
Yes, "we've come a very, very long way" on the evaluation. But shouldn't it evaluate the linguistic abilities in a really fine-grained and well-designed way, rather than just discarding or ignoring it? We know that for tasks of fundemantal NLP, like sentence parsing, WSD, etc, it can't recieve 100% performance yet.
So much great detail! Thanks for this work.
The compute parade reminds me of the mechanicus.
The condensed matter physics dataset is funny. I can barely read the question, let alone answer it.
The muon optimizer appears to be as good as Adam at 1.2B parameter models and the other parameters weren’t tested in the graph showing performance so it’s not clear what the takeaway is, especially since 1.2B is tiny relative to what is used like you said