"By adding seemingly relevant but ultimately irrelevant information to problems, we demonstrate substantial performance drops (up to 65%) across all state-of-the-art models"
On the one hand, I feel embarassed for not having thought of this myself.
On the other hand, i feel happy at the realization that we still have a fast growing tree of technology with 'easily' exploitable potentials that just need to be discovered.
Wow yeah the gsm no op is a lot worse than gsm. I suppose models aren’t trained with irrelevant data like this during SFT or RLHF
"By adding seemingly relevant but ultimately irrelevant information to problems, we demonstrate substantial performance drops (up to 65%) across all state-of-the-art models"
On the one hand, I feel embarassed for not having thought of this myself.
On the other hand, i feel happy at the realization that we still have a fast growing tree of technology with 'easily' exploitable potentials that just need to be discovered.
Typo - "... the largest decentralized training runs were of the order of 1bn, so 500bn is a big difference..." should be 10bn instead of 500bn.