FIFO-Diffusion: Generating Infinite Videos from Text
without Training
NeurIPS 2024
An astronaut floating in space, high quality, 4K resolution.
A spectacular fireworks display over Sydney Harbour, 4K, high resolution.
A colony of penguins waddling on an Antarctic ice sheet, 4K, ultra HD.
We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without training. This is achieved by iteratively performing diagonal denoising, which concurrently processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner ones by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. We have demonstrated the promising results and effectiveness of the proposed methods on strong text-to-video generation baselines.
A vibrant underwater scene of a scuba diver exploring a shipwreck, 2K, photorealistic.
A dark knight riding on a black horse on the glassland, photorealistic, 4k, high definition.
An astronaut walking on the moon's surface, high-quality, 4K resolution.
A colorful macaw flying in the rainforest, ultra HD.
A majestic lion roaming the savannah, 4K, ultra HD.
A panoramic view of a peaceful Zen garden, high-quality, 4K resolution.
A high-speed motorcycle race on a track, ultra HD, 4K resolution.
The video captures the majestic beauty of a waterfall cascading down a cliff into a serene lake. The waterfall, with its powerful flow, is the central focus of the video. The surrounding landscape is lush and green, with trees and foliage adding to the natural beauty of the scene. The camera angle provides a bird's eye view of the waterfall, allowing viewers to appreciate the full height and grandeur of the waterfall. The video is a stunning representation of nature's power and beauty.
Slow pan upward of blazing oak fire in an indoor fireplace.
A quiet beach at dawn, the waves gently lapping at the shore and the sky painted in pastel hues.
a serene winter scene in a forest. The forest is blanketed in a thick layer of snow, which has settled on the branches of the trees, creating a canopy of white. The trees, a mix of evergreens and deciduous, stand tall and silent, their forms partially obscured by the snow. The ground is a uniform white, with no visible tracks or signs of human activity. The sun is low in the sky, casting a warm glow that contrasts with the cool tones of the snow. The light filters through the trees, creating a soft, diffused illumination that highlights the texture of the snow and the contours of the trees. The overall style of the scene is naturalistic, with a focus on the tranquility and beauty of the winter landscape.
A snowy forest landscape with a dirt road running through it. The road is flanked by trees covered in snow, and the ground is also covered in snow. The sun is shining, creating a bright and serene atmosphere. The road appears to be empty, and there are no people or animals visible in the video. The style of the video is a natural landscape shot, with a focus on the beauty of the snowy forest and the peacefulness of the road.
Sunset over the sea.
The dynamic movement of tall, wispy grasses swaying in the wind. The sky above is filled with clouds, creating a dramatic backdrop. The sunlight pierces through the clouds, casting a warm glow on the scene. The grasses are a mix of green and brown, indicating a change in seasons. The overall style of the video is naturalistic, capturing the beauty of the landscape in a realistic manner. The focus is on the grasses and their movement, with the sky serving as a secondary element. The video does not contain any human or animal elements.
The majestic beauty of a waterfall cascading down a cliff into a serene lake.
A school of colorful fish swimming in a coral reef, ultra high quality, 2K.
An exciting mountain bike trail ride through a forest, 2K, ultra HD.
A panoramic view of the Himalayas from a drone, high definition, 4K.
A spectacular fireworks display over Sydney Harbour, 4K, high resolution.
A beautiful cherry blossom festival, time-lapse, high quality.
A close-up of a tarantula walking, high definition.
A detailed macro shot of a blooming rose, 4K.
A panoramic view of the Milky Way, ultra HD.
A thrilling white water rafting adventure, high definition.
We compare FIFO-Diffusion with one training-based autoregressive generation, LaVie (T2V) + SEINE (I2V), and two training-free techniques, FreeNoise and Gen-L-Video, applied to VideoCrafter2. The training-based autoregressive method (LaVie + SEINE) exhibits periodic discontinuities, quickly diverging from the input text, while the training-free methods (FreeNoise and Gen-L-Video) display less temporal consistency, visual quality, and lack of motion.
A vibrant underwater scene of a scuba diver exploring a shipwreck, 2K, photorealistic.
An astronaut floating in space, high quality, 4K resolution.
A high-speed motorcycle race on a track, ultra HD, 4K resolution.
A panoramic view of a peaceful Zen garden, high-quality, 4K resolution.
A pair of tango dancers performing in Buenos Aires, 4K, high resolution.
A spooky haunted house, foggy night, high definition.
We conduct an ablation study to investigate the effectiveness of each component in FIFO-Diffusion. We compare the results of FIFO-Diffusion only with diagonal denoising (DD), with the addition of latent partitioning with n=4 (LP), and lookahead denoising (LD). LP significantly improves the quality and temporal consistency of the generated videos, while LD further mitigates flickering artifacts.
A beautiful cherry blossom festival, time-lapse, high quality.
A panoramic view of the Milky Way, ultra HD.
A detailed macro shot of a blooming rose, 4K.
A beautiful cathedral interior with stained glass, high quality.
A mysterious foggy forest at dawn, high quality, 4K.