https://static1.squarespace.com/static/6213c340453c3f502425776e/t/655ce779b9d47d342a93c890/1700587395994/stable_video_diffusion.pdf

A cut refers to an abrupt transition from one scene or shot to another within a video. Cuts are commonly used in video production to change scenes, viewpoints, or to move the narrative along without a smooth transition, such as a crossfade or dissolve. In the raw or unedited footage, cuts might not be intentional and could result from recording interruptions, changes in camera angles, or unwanted disruptions in the recording environment.
Impact on Synthesized Videos: Cuts can introduce abrupt changes in the visual content, which may be undesirable when training models for video synthesis. Models trained on data with many cuts might learn these discontinuities, leading to synthesized videos that also contain abrupt or nonsensical transitions, reducing the quality and coherence of the generated content.
Definition: A clip is a short segment of a video, usually continuous in time and content. Clips are extracted from longer videos to capture specific scenes or sequences without abrupt changes in content or style.
Utility in Video Pretraining: Using clips instead of entire unprocessed videos allows for more controlled and consistent input to video synthesis models. By focusing on continuous, cut-free segments, the model can learn from coherent visual narratives, improving its ability to generate similar continuous content.
FPS: frame per second is number of images per second. The primary purpose of FPS is to define the temporal resolution of the video. It dictates how motion is captured and displayed, with direct implications for the video's visual fluidity and the viewer's experience.
Optical flow is a concept from computer vision and video processing that describes the apparent motion of objects, surfaces, and edges in a visual scene, caused by the relative motion between an observer (camera or eye) and the scene. It is represented as a vector field where each vector shows the displacement of points from one frame to the next.

Curating Data for HQ Video Synthesis

3.1. Data Processing and Annotation

We first collect initial dataset for image pretraining.

Cut detection

detect cuts, convert to higher nnumber of clips.

Annotate each clip with caption

Clip is a set of images in sequence to each other.

Since its a ordered sequence, we just pick the middle image. use image captionerCoCa [103] to annotate the mid-frame of each clip.

Smoothness

Now, Each clip contains a caption

Flow score: how static or fluid an image transitions to another image of same consistency