Skip to content

v0.30.0: New Pipelines (Flux, Stable Audio, Kolors, CogVideoX, Latte, and more), New Methods (FreeNoise, SparseCtrl), and New Refactors

Compare
Choose a tag to compare
@sayakpaul sayakpaul released this 07 Aug 07:47
· 318 commits to main since this release

New pipelines

Untitled

Image taken from the Lumina’s GitHub.

This release features many new pipelines. Below, we provide a list:

Audio pipelines 🎼

Video pipelines 📹

Image pipelines 🎇

Be sure to check out the respective docs to know more about these pipelines. Some additional pointers are below for curious minds:

  • Lumina introduces a new DiT architecture that is multilingual in nature.
  • Kolors is inspired by SDXL and is also multilingual in nature.
  • Flux introduces the largest (more than 12B parameters!) open-sourced DiT variant available to date. For efficient DreamBooth + LoRA training, we recommend @bghira’s guide here.
  • We have worked on a guide that shows how to quantize these large pipelines for memory efficiency with optimum.quanto. Check it out here.
  • CogVideoX introduces a novel and truly 3D VAE into Diffusers.

Perturbed Attention Guidance (PAG)

Without PAG With PAG

We already had community pipelines for PAG, but given its usefulness, we decided to make it a first-class citizen of the library. We have a central usage guide for PAG here, which should be the entry point for a user interested in understanding and using PAG for their use cases. We currently support the following pipelines with PAG:

  • StableDiffusionPAGPipeline
  • StableDiffusion3PAGPipeline
  • StableDiffusionControlNetPAGPipeline
  • StableDiffusionXLPAGPipeline
  • StableDiffusionXLPAGImg2ImgPipeline
  • StableDiffusionXLPAGInpaintPipeline
  • StableDiffusionXLControlNetPAGPipeline
  • StableDiffusion3PAGPipeline
  • PixArtSigmaPAGPipeline
  • HunyuanDiTPAGPipeline
  • AnimateDiffPAGPipeline
  • KolorsPAGPipeline

If you’re interested in helping us extend our PAG support for other pipelines, please check out this thread.
Special thanks to Ahn Donghoon (@sunovivid), the author of PAG, for helping us with the integration and adding PAG support to SD3.

AnimateDiff with SparseCtrl

SparseCtrl introduces methods of controllability into text-to-video diffusion models leveraging signals such as line/edge sketches, depth maps, and RGB images by incorporating an additional condition encoder, inspired by ControlNet, to process these signals in the AnimateDiff framework. It can be applied to a diverse set of applications such as interpolation or video prediction (filling in the gaps between sequence of images for animation), personalized image animation, sketch-to-video, depth-to-video, and more. It was introduced in SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models.

There are two SparseCtrl-specific checkpoints and a Motion LoRA made available by the authors namely:

Scribble Interpolation Example:

Image 1 Image 2 Image 3
Image 4
import torch

from diffusers import AnimateDiffSparseControlNetPipeline, AutoencoderKL, MotionAdapter, SparseControlNetModel
from diffusers.schedulers import DPMSolverMultistepScheduler
from diffusers.utils import export_to_gif, load_image

motion_adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-3", torch_dtype=torch.float16).to(device)
controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-scribble", torch_dtype=torch.float16).to(device)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to(device)
pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
    "SG161222/Realistic_Vision_V5.1_noVAE",
    motion_adapter=motion_adapter,
    controlnet=controlnet,
    vae=vae,
    scheduler=scheduler,
    torch_dtype=torch.float16,
).to(device)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, beta_schedule="linear", algorithm_type="dpmsolver++", use_karras_sigmas=True)
pipe.load_lora_weights("guoyww/animatediff-motion-lora-v1-5-3", adapter_name="motion_lora")
pipe.fuse_lora(lora_scale=1.0)

prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality"
negative_prompt = "low quality, worst quality, letterboxed"

image_files = [
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png"
]
condition_frame_indices = [0, 8, 15]
conditioning_frames = [load_image(img_file) for img_file in image_files]

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=25,
    conditioning_frames=conditioning_frames,
    controlnet_conditioning_scale=1.0,
    controlnet_frame_indices=condition_frame_indices,
    generator=torch.Generator().manual_seed(1337),
).frames[0]
export_to_gif(video, "output.gif")

📜 Check out the docs here.

FreeNoise for AnimateDiff

FreeNoise is a training-free method that allows extending the generative capabilities of pretrained video diffusion models beyond their existing context/frame limits.

Instead of initializing noises for all frames, FreeNoise reschedules a sequence of noises for long-range correlation and performs temporal attention over them using a window-based function. We have added FreeNoise to the AnimateDiff family of models in Diffusers, allowing them to generate videos beyond their default 32 frame limit.
 

import torch
from diffusers import AnimateDiffPipeline, MotionAdapter, EulerAncestralDiscreteScheduler
from diffusers.utils import export_to_gif

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
pipe = AnimateDiffPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
pipe.scheduler = EulerAncestralDiscreteScheduler(
    beta_schedule="linear",
    beta_start=0.00085,
    beta_end=0.012,
)

pipe.enable_free_noise()
pipe.vae.enable_slicing()

pipe.enable_model_cpu_offload()
frames = pipe(
    "An astronaut riding a horse on Mars.",
    num_frames=64,
    num_inference_steps=20,
    guidance_scale=7.0,
    decode_chunk_size=2,
).frames[0]

export_to_gif(frames, "freenoise-64.gif")

LoRA refactor

We have significantly refactored the loader classes associated with LoRA. Going forward, this will help in adding LoRA support for new pipelines and models. We now have a LoraBaseMixin class which is subclassed by the different pipeline-level LoRA loading classes such as StableDiffusionXLLoraLoaderMixin. This document provides an overview of the available classes.

Additionally, we have increased the coverage of methods within the PeftAdapterMixin class. This refactoring allows all the supported models to share common LoRA functionalities such set_adapter(), add_adapter(), and so on.

To learn more details, please follow this PR. If you see any LoRA-related issues stemming from these refactors, please open an issue.

🚨 Fixing attention projection fusion

We discovered that the implementation of fuse_qkv_projections() was broken. This was fixed in this PR. Additionally, this PR added the fusion support to AuraFlow and PixArt Sigma. A reasoning as to where this kind of fusion might be useful is available here.

All commits

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @DN6
    • [Tests] Fix precision related issues in slow pipeline tests (#8720)
    • Remove legacy single file model loading mixins (#8754)
    • Enforce ordering when running Pipeline slow tests (#8763)
    • Fix warning in UNetMotionModel (#8756)
    • Fix indent in dreambooth lora advanced SD 15 script (#8753)
    • Fix mistake in Single File Docs page (#8765)
    • [Single File] Allow loading T5 encoder in mixed precision (#8778)
    • Fix saving text encoder weights and kohya weights in advanced dreambooth lora script (#8766)
    • Add VAE tiling option for SD3 (#8791)
    • Add single file loading support for AnimateDiff (#8819)
    • Add option to SSH into CPU runner. (#8884)
    • SSH into cpu runner fix (#8888)
    • SSH into cpu runner additional fix (#8893)
    • Update pipeline test fetcher (#8931)
    • Fix name when saving text inversion embeddings in dreambooth advanced scripts (#8927)
    • [CI] Skip flaky download tests in PR CI (#8945)
    • [CI] Slow Test Updates (#8870)
    • [CI] Fix parallelism in nightly tests (#8983)
    • [CI] Nightly Test Runner explicitly set runner for Setup Pipeline Matrix (#8986)
    • Updates deps for pipeline test fetcher (#9033)
    • Fix Nightly Deps (#9036)
    • update
    • [Docs] Add community projects section to docs (#9013)
    • [Single File] Add single file support for Flux Transformer (#9083)
    • Freenoise change vae_batch_size to decode_chunk_size (#9110)
  • @shauray8
    • add PAG support for SD architecture (#8725)
  • @gnobitab
    • [Tencent Hunyuan Team] Add HunyuanDiT-v1.2 Support (#8747)
    • [Tencent Hunyuan Team] Add checkpoint conversion scripts and changed controlnet (#8783)
  • @yiyixuxu
    • [doc] add a tip about using SDXL refiner with hunyuan-dit and pixart (#8735)
    • [hunyuan-dit] refactor HunyuanCombinedTimestepTextSizeStyleEmbedding (#8761)
    • correct attention_head_dim for JointTransformerBlock (#8608)
    • fix loading sharded checkpoints from subfolder (#8798)
    • Revert "[LoRA] introduce LoraBaseMixin to promote reusability." (#8976)
    • fix load sharded checkpoint from a subfolder (local path) (#8913)
    • add sentencepiece as a soft dependency (#9065)
  • @PommesPeter
    • [Alpha-VLLM Team] Add Lumina-T2X to diffusers (#8652)
  • @IrohXu
    • Add pipeline_stable_diffusion_3_inpaint.py for SD3 Inference (#8709)
  • @maxin-cn
    • Latte: Latent Diffusion Transformer for Video Generation (#8404)
  • @ustcuna
    • [Community Pipelines] Accelerate inference of AnimateDiff by IPEX on CPU (#8643)
  • @tuanh123789
    • add PAG support sd15 controlnet (#8820)
  • @Snailpong
    • 🌐 [i18n-KO] Translated docs to Korean (added 7 docs and etc) (#8804)
  • @asfiyab-nvidia
    • Update TensorRT img2img community pipeline (#8899)
    • Update TensorRT txt2img and inpaint community pipelines (#9037)
  • @ylacombe
    • Stable Audio integration (#8716)
    • Fix Stable Audio repository id (#9016)
  • @sunovivid
    • add PAG support for Stable Diffusion 3 (#8861)
  • @zRzRzRzRzRzRzR
    • Add CogVideoX text-to-video generation model (#9082)