https://static1.squarespace.com/static/6213c340453c3f502425776e/t/655ce779b9d47d342a93c890/1700587395994/stable_video_diffusion.pdf

Curating Data for HQ Video Synthesis

3.1. Data Processing and Annotation

We first collect initial dataset for image pretraining.

Cut detection

detect cuts, convert to higher nnumber of clips.

Annotate each clip with caption

Clip is a set of images in sequence to each other.

Since its a ordered sequence, we just pick the middle image. use image captionerCoCa [103] to annotate the mid-frame of each clip.

Smoothness

Now, Each clip contains a caption

Flow score: how static or fluid an image transitions to another image of same consistency