04. Machine Learning

Image Captioning

Upload an image and a neural network generates a natural-language description. All inference runs in your browser.

Original assignment: VGG16 encoder + LSTM decoder trained on MS-COCO captions using Apache Spark. Live demo uses ViT-GPT2, a Vision Transformer captioning model loaded on demand.

Load the captioning model

First load downloads approximately 100 MB of model weights from Hugging Face. After that, captions are generated instantly in your browser with no server involved.

Upload an image

🖼
Click to select an image, or drag and drop
JPEG, PNG, WebP supported
Uploaded image preview
Generated caption

How it works

The original university assignment trained a VGG16 convolutional encoder (pretrained on ImageNet) paired with an LSTM caption decoder on a subset of MS-COCO captions, using Apache Spark for distributed preprocessing.

The live demo runs ViT-GPT2, a pretrained image-to-text model from Hugging Face. A Vision Transformer (ViT) encodes the image into patch embeddings, which a GPT-2 language model decoder then converts into a natural-language caption. The ONNX-quantized model runs locally in your browser via Transformers.js.

No images are uploaded to any server. Everything runs in WebAssembly inside your browser tab.