Discussion about this post

User's avatar
Nicholas Wagner's avatar

Is it possible that there is truth to the rumors DeepSeek used chains of thought from other companies for post-training while Mistral did not? Maybe that explains the gap with R1.

Expand full comment
Steeven's avatar

Waymo’s scaling law seems to be a bit worse than LLM scaling. Double the compute leading to ~8% improvement in cross entropy is really slow scaling. I guess I don’t have an intuitive sense of how long it takes to collect that data. I don’t really log all my driving motions each time I get in a car. What a shame, if we did, we’d probably have Waymos everywhere

Expand full comment
3 more comments...

No posts