Mon. Dec 23rd, 2024
Learn How To Play Minecraft With Video Pre Training

The internet contains a huge amount of public videos that we can learn from. Watch people give gorgeous presentations, digital artists paint beautiful sunsets, and Minecraft players build intricate houses.However, these videos only provide a recording what Happened, but not exactly how That has been achieved. That is, we don’t know the exact order in which mouse movements and keys were pressed.If you want to build something large basic model In these domains, just as we did with languages, GPTthis lack of action labels poses new challenges that do not exist in the linguistic domain, where the “action label” is simply the next word in a sentence.

To take advantage of the rich unlabeled video data available on the Internet, we introduce video pretraining (VPT), ​​a novel yet simple semi-supervised imitation learning method. First, we collect a small dataset from the contractor, recording not only the video but also the actions performed by the contractor (in our case, key presses and mouse movements). Use this data to train an inverse dynamics model (IDM) that predicts the actions taken at each step of the video. Importantly, IDM can use historical data. and the future Information to infer actions at each step. This task is much simpler and therefore requires much less data than the behavioral replication task of predicting a given action. Past video frames only, you have to guess what the person wants to do and how to achieve it. Her trained IDM can then be used to label much larger datasets of online videos and learn to operate through behavioral cloning.