The Internet contains a huge amount of publicly available videos from which we can learn. You can watch a person create a beautiful presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. However, these videos only provide a recording what happened but not exactly how this is achieved, i.e. you will not know the exact sequence of mouse movements and pressed keys. If we want to build on a large scale foundation models in these domains as we did in language s GPTthis lack of action markers presents a new challenge not present in the domain of language, where “action markers” are simply the next words in a sentence.
To exploit the wealth of unlabeled video data available on the Internet, we present a new, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). We start by collecting a small dataset from the performers where we record not only their video, but also the actions they took, which in our case are keystrokes and mouse movements. With this data, we train an Inverse Dynamics Model (IDM), which predicts the action taken at each step in the video. It is important that IDM can use the paste and the future information to guess the plot at every turn. This task is much easier and therefore requires far less data than the behavioral cloning task of predicting default actions past video frames only, which requires inferring what the person wants to do and how to achieve it. We can then use the trained IDM to annotate a much larger dataset of online videos and learn to act via behavioral cloning.