RGB no more: Minimally-decoded JPEG Vision Transformers
CVPR 2023


Most neural networks for computer vision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training Vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart.

CVPR 2023 Video


Decoding JPEG is expensive. Typical neural networks for computer vision receive RGB pixels as inputs. However, RGB images are often stored on disk as compressed JPEG files. Decoding JPEG requires expensive inverse DCT; avoiding this by directly training from DCT can theoretically reduce compute by up to 87.6%. This can reduce data loading bottleneck and accelerate the entire pipeline.


We improve upon prior works in two key ways: model architecture and data augmentation.

Model Architecture

ViTs are well suited to train from DCT. JPEG represents RGB images into 8x8 grid of tensors of brightness and color in DCT. The color channels are also typically downsampled by a factor of 2. Prior work using CNN require nontrivial modifications to the architecture, limiting their adaptability to existing networks.
  We instead use Vision Transformers (ViTs) to overcome this issue. ViTs work by encoding a grid of image patches into embeddings. This is a perfect match for JPEG as it also represents images into a grid. We show that ViTs can be easily adapted to DCT by modifying only the patch embedding layer; rest of the model can remain untouched.

Data Augmentation

Prior work:


We directly augment DCT instead of converting to RGB. Data augmentation is critical for training accurate networks. However, augmenting in DCT have not been well explored. Prior works therefore implement this by converting DCT to RGB, augment, and converting it back to DCT. This however incurs multiple expensive DCT transforms, negating all benefits of using DCT during training.
  We instead directly augment DCT by analyzing the correlation between RGB and DCT. We implement all augmentations used by RandAugment and introduce several augmentations that are natural for DCT. This way, we can benefit from faster data loading even during training.


Our ViT-Ti models show up to 39.2% and 17.9% faster training and evaluation without loss in accuracy. Directly augmenting DCT improves data loading speed by up to 93.2% compared to prior works. We also adapt SwinV2 using our method and show 12% faster evaluation.


The website is based on a template by Ref-NeRF and Michaël Gharbi.