Simply add images and video, and these estimates start to sound like the "640 KB...

netcan · 2025-07-02T13:03:43 1751461423

Like both will be done. Idk what the roi is on adding video data to the text models, but it's presumably lower than text.

There are just a lot of avenues to try at this point.

llSourcell · 2025-07-02T17:46:42 1751478402

no its not lower than text, its higher ROI than text for understanding the physics of the world, which is exactly what videos are better at than text when it comes to training data

AstroBen · 2025-07-02T19:58:03 1751486283

Does that transfer, though? I'm not sure we can expect its ability to approximate physics in video form would transfer to any other mode (text, code, problem solving etc)

ricopags · 2025-07-02T20:41:30 1751488890

depends on the hyperparams but one of the biggest benefits of a latent space is transfer between modalities