Video Lesson on Advanced Multimodal AI Concepts

Low-level details about the Video Format. Contrastive Learning, CLIP Model, How VLMs work, Transformers vs CNNs and Context Learning

Alex Razvant

Sep 13, 2025

Welcome to Neural Bits. Each week, get one deep-dive article covering advanced, production-ready AI/ML development.

Subscribe to join 5900+ AI/ML Engineers learning how to build production-ready AI Systems.

This article is an extra module to the Open Source Kubrick course, which I’ve built in collaboration with

Miguel Otero Pedrido

(from The Neural Maze).

In this article, I’ve recorded two videos on advanced topics in Multimodal AI, which will help you understand video formats as well as Multimodal Models, such as CLIP and Vision Language Models (VLMs), alongside other insights.

Find the full Video Course on Kubrick, below:

My recommendation, if you’re not familiar with the course, is to watch the full video walkthrough, and then learn from these two extra deep dives, which focus on more advanced Deep Learning and Multimodal Data concepts.

Happy learning!

Introduction

How Video Format Works (7m)

Here you’ll learn low-level details about the Video Format, and how any video player (QuickTime Player, VLC, OpenCV, etc) is reading a video file, decodes it, and displays images, plays audio at the right time.

Summary of the topics:

Opening an MP4 video in Hex Format
Reading and explaining the Video Header and Encoded Packets
Learning how Video Re-Encoding works
Learning about different codecs H264 (AVC) vs H265 (HEVC)
Learning about ISO Multimedia Standards

Redactions:

1/ In the video, I said H. 265 keeps better Lightning and Softer Shadows.
- That’s true in the context of H.265 codec being compatible with HDR (High Dynamic Range, e.g, DolbyVision). The codec itself is only a better compression method than H264.

How the CLIP Model and VLM Work in General

Here you’ll learn about Contrastive Learning, the Loss Objective CLIP is trying to optimize for, how it was trained, and other low-level architecture and workflow details. You’ll also learn about Image Encoders, Patching, CNN Receptive Fields, Vision Transformer, and how CLIP can be used as part of VLMs for the VQA (Visual Question Answering) task, going through a VLM architecture, step-by-step.

Summary of the topics:

CLIP Model Card and the Model Scope
How was it trained, Contrastive Learning, Loss Function
Vision Transformer, Image Patching, Positional Embeddings
How is ViT compared to CNNs’ Receptive Fields when learning Image Features
Using interactive 3D Vectors in Desmos UI to showcase Contrastive Loss
Explaining CLIP as part of the Image Encoder in a VLM Architecture

Redactions:

1/ In the video, I mention that ViT is better at learning Global Image Context compared to CNNs
- That’s true, if we compare network sizes. Vision Transformers are capable of learning global features, due to how Attention Works. CNNs, on the other hand, are still capable of that, but we need to increase the Network depth, such that Receptive Fields capture more context throughout the stacked layers chain.

Ending Notes

If you’ve enjoyed these and want to stay updated when I post more similar content, but focused on End-to-End projects, make sure to also follow me on:

1/ Daily Content on AI Engineering