Compressing images and audio with diffusion latents

Did you know that you can use diffusion latents to compress images, sound, and other media?

This post from 2022 by Matthias Bühlmann goes into a lot of detail for images: Stable Diffusion Based Image Compression

Compressing audio with Stable Audio!

I wrote a "compression" script using the VAE from Stable Audio Open 1.0 and tried it on some samples.

That's All Folks:

Original (1,321.1 KB WAV, 102.8 kilobytes MP3):
VAE-encoded (41.2 KB, assuming 32-bit float):
VAE-encoded, 2-bit quantized (3.2 KB):
VAE-encoded, 1-bit quantized (1.9 KB):

What's New Pussycat:

Original (727.2 KB WAV, 105.8 kilobytes MP3):
VAE-encoded (22.5 KB, assuming 32-bit float):
VAE-encoded, 2-bit quantized (1.9 KB):
VAE-encoded, 1-bit quantized (1.2 KB):

* Note: The audio files above are encoded as MP3, to avoid casually sending large WAV files over the internet. IMO it still demonstrates the quality difference well. If you'd like the raw WAV files, you can download an archive here: audio_compressed.7z

The latent shape for "That's All Folks" is [1 x 64 x 161]. This equals 10304 values. If we were to encode this many 32-bit floating point values, it would take up 41216 bytes, or 41.2 kilobytes. To encode that many bits (1-bit quantization), it would take up 1288 bytes.

For reference, this is less than the lowest quality OGG export I could manage from Audacity, which resulted in 52.1 kilobytes.

The script I wrote is a Gist on GitHub here: https://gist.github.com/nukep/dfdd06a7d3ceabbf0f8b9a23972f04c6

What is a latent, anyway?

Well, in short, it is compressed data. The motivating idea to use latents during diffusion is to process less data at a time, in a more semantically aware manner.

For something like Stable Diffusion 1.5, each 8x8 block of pixels is represented as 4 floating point values.

In ComfyUI or the diffusers library, if you had an image the shape of [B,H,W,3], then the SD 1.5 latent would be in the shape of [B,4,H/8,W/8]. An 8x8 block goes from 192 values (8 x 8 x 3) down to 4.

Note that the shape of a latent depends on the VAE for the model (SD 3 is different than SD 1.5).

For a breakdown on images, I highly recommend reading this article by Matthias: Stable Diffusion Based Image Compression

Matthias Bühlmann's attempt at compressing with Stable Diffusion Source: Matthias Bühlmann