This technical report focuses on (1) how to transform all kinds of visual data into a unified representation that enables large-scale training of generative models, and (2) a qualitative assessment of Sora’s capabilities and limitations. Masu. Model and implementation details are not included in this report.
Previous studies have investigated generative modeling of video data using various methods such as recurrent networks.[^1][^2][^3] generative adversarial network,[^4][^5][^6][^7] autoregressive transformer,[^8][^9] and a diffusion model.[^10][^11][^12] These works often focus on narrow categories of visual data, short videos, or fixed-size videos. Sora is a general-purpose model for visual data that can generate videos and images across a variety of lengths, aspect ratios, and resolutions, up to 1 minute of high-definition video.