Image Captioning

Upload an image and a neural network generates a natural-language description. All inference runs in your browser.

Original assignment: VGG16 encoder + LSTM decoder trained on MS-COCO captions using Apache Spark. Live demo uses ViT-GPT2, a Vision Transformer captioning model loaded on demand.

Load the captioning model

First load downloads approximately 100 MB of model weights from Hugging Face. After that, captions are generated instantly in your browser with no server involved.

Sample captions from training data

These captions were generated by the original VGG16+LSTM model trained on the Flickr8k dataset during the university assignment.

How it works

The original university assignment trained a VGG16 convolutional encoder (pretrained on ImageNet) paired with an LSTM caption decoder on a subset of MS-COCO captions, using Apache Spark for distributed preprocessing.

The live demo runs ViT-GPT2, a pretrained image-to-text model from Hugging Face. A Vision Transformer (ViT) encodes the image into patch embeddings, which a GPT-2 language model decoder then converts into a natural-language caption. The ONNX-quantized model runs locally in your browser via Transformers.js.

No images are uploaded to any server. Everything runs in WebAssembly inside your browser tab.