FIFO-Diffusion: Generating Infinite Videos from Text
without Training

Jihwan Kim*1 Junoh Kang*1 Jinyoung Choi1 Bohyung Han1, 2

1ECE & 2IPAI, Seoul National University
(* Equal Contribution)
{kjh26720, junoh.kang, jin0.choi, bhhan}@snu.ac.kr

[arXiv]      [Code]


1K-frame Long Videos (512 x 320 resolution, VideoCrafter2)

Abstract

We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. We have demonstrated the promising results and effectiveness of the proposed methods on strong text-to-video generation baselines.

VideoCrafter2 + FIFO-Diffusion (512 frames, 512 x 320 resolution)


An astronaut walking on the moon's surface, high-quality, 4K resolution.


A vibrant underwater scene of a scuba diver exploring a shipwreck, 2K, photorealistic.


A dark knight riding on a black horse on the glassland, photorealistic, 4k, high definition.


A colorful macaw flying in the rainforest, ultra HD.


A majestic lion roaming the savannah, 4K, ultra HD.


A paraglider soaring over the Alps, photorealistic, 4K, high definition.


A panoramic view of a peaceful Zen garden, high-quality, 4K resolution.


A high-speed motorcycle race on a track, ultra HD, 4K resolution.

Open-Sora-Plan + FIFO-Diffusion (385 frames, 512 x 512 resolution)


The video captures the majestic beauty of a waterfall cascading down a cliff into a serene lake. The waterfall, with its powerful flow, is the central focus of the video. The surrounding landscape is lush and green, with trees and foliage adding to the natural beauty of the scene. The camera angle provides a bird's eye view of the waterfall, allowing viewers to appreciate the full height and grandeur of the waterfall. The video is a stunning representation of nature's power and beauty.


Slow pan upward of blazing oak fire in an indoor fireplace.


A quiet beach at dawn, the waves gently lapping at the shore and the sky painted in pastel hues.


a serene winter scene in a forest. The forest is blanketed in a thick layer of snow, which has settled on the branches of the trees, creating a canopy of white. The trees, a mix of evergreens and deciduous, stand tall and silent, their forms partially obscured by the snow. The ground is a uniform white, with no visible tracks or signs of human activity. The sun is low in the sky, casting a warm glow that contrasts with the cool tones of the snow. The light filters through the trees, creating a soft, diffused illumination that highlights the texture of the snow and the contours of the trees. The overall style of the scene is naturalistic, with a focus on the tranquility and beauty of the winter landscape.


A snowy forest landscape with a dirt road running through it. The road is flanked by trees covered in snow, and the ground is also covered in snow. The sun is shining, creating a bright and serene atmosphere. The road appears to be empty, and there are no people or animals visible in the video. The style of the video is a natural landscape shot, with a focus on the beauty of the snowy forest and the peacefulness of the road.


Sunset over the sea.


The dynamic movement of tall, wispy grasses swaying in the wind. The sky above is filled with clouds, creating a dramatic backdrop. The sunlight pierces through the clouds, casting a warm glow on the scene. The grasses are a mix of green and brown, indicating a change in seasons. The overall style of the video is naturalistic, capturing the beauty of the landscape in a realistic manner. The focus is on the grasses and their movement, with the sky serving as a secondary element. The video does not contain any human or animal elements.


The majestic beauty of a waterfall cascading down a cliff into a serene lake.

VideoCrafter1 + FIFO-Diffusion (100 frames, 512 x 320 resolution)


A school of colorful fish swimming in a coral reef, ultra high quality, 2K.


An exciting mountain bike trail ride through a forest, 2K, ultra HD.


A paraglider soaring over the Alps, photorealistic, 4K, high definition.


A panoramic view of the Himalayas from a drone, high definition, 4K.


A spectacular fireworks display over Sydney Harbour, 4K, high resolution.

zeroscope + FIFO-Diffusion (100 frames, 576 x 320 resolution)


A beautiful cherry blossom festival, time-lapse, high quality.


A close-up of a tarantula walking, high definition.


A detailed macro shot of a blooming rose, 4K.


A panoramic view of the Milky Way, ultra HD.


A thrilling white water rafting adventure, high definition.

FIFO-Diffusion vs Others (512 X 320 resolution)

We compare FIFO-Diffusion with one training-based autoregressive generation, LaVie (T2V) + SEINE (I2V), and two training-free techniques, FreeNoise and Gen-L-Video, applied to VideoCrafter2. The training-based autoregressive method (LaVie + SEINE) exhibits periodic discontinuities, quickly diverging from the input text, while the training-free methods (FreeNoise and Gen-L-Video) display less temporal consistency, visual quality, and lack of motion.


A vibrant underwater scene of a scuba diver exploring a shipwreck, 2K, photorealistic.


An astronaut floating in space, high quality, 4K resolution.


A high-speed motorcycle race on a track, ultra HD, 4K resolution.


A panoramic view of a peaceful Zen garden, high-quality, 4K resolution.


A pair of tango dancers performing in Buenos Aires, 4K, high resolution.


A spooky haunted house, foggy night, high definition.

BibTex