04. Machine Learning
Upload an image and a neural network generates a natural-language description. All inference runs in your browser.
First load downloads approximately 100 MB of model weights from Hugging Face. After that, captions are generated instantly in your browser with no server involved.
These captions were generated by the original VGG16+LSTM model trained on the Flickr8k dataset during the university assignment.
The original university assignment trained a VGG16 convolutional encoder (pretrained on ImageNet) paired with an LSTM caption decoder on a subset of MS-COCO captions, using Apache Spark for distributed preprocessing.
The live demo runs ViT-GPT2, a pretrained image-to-text model from Hugging Face. A Vision Transformer (ViT) encodes the image into patch embeddings, which a GPT-2 language model decoder then converts into a natural-language caption. The ONNX-quantized model runs locally in your browser via Transformers.js.
No images are uploaded to any server. Everything runs in WebAssembly inside your browser tab.