This technical report focuses on (1) our method for converting visual data of all types into a unified presentation that enables the training of large-scale generative models and (2) a qualitative assessment of Sora's capabilities and limitations. Model and implementation details are not included in this report.
Much previous work has looked at generative modeling of video data using various methods, including recurrent networks,(^1)(^2)(^3) generative adversarial networks,(^4)(^5)(^6)(^7) autoregressive transformers,(^8)(^9) and diffusion models.(^10)(^11)(^12) These works often focus on a narrow category of visual data, on shorter videos or on videos of a fixed size. Sora is a general visual data model—it can generate videos and images in various durations, aspect ratios, and resolutions, up to a full minute of high-definition video.