Cracking the Code to Multimodal AI Pipelines

Alex Razvant

Jul 5

Building efficient Multimodal Systems, unpacking Video, VLMs and Speech to Text (STT)

Read →

11 Comments

mohamed sheded

Thanks this is Very helpful, how you face the challenge of rate limit for longer video frames

like 120 frame --> "gpt 4o mini" will exceed the limit rate, right ?

and where is the pixel table storing this data , is that persistent so i can get back to it ?

Expand full comment

Reply (1)

Alex Razvant

Pixeltable handles rate-limiting automatically, check the OpenAIRateLimitsInfo class in the library implementation here: https://github.com/pixeltable/pixeltable/blob/main/pixeltable/functions/openai.py

And yes, pixeltable caches all the processed tables either locally or on remote storage (if you specify a connector when setting up pixeltable), so you can get back to it by simply loading pxt.get_table(<name>).

By default, pixeltable creates a .pixeltable folder at your ~ home path, and it stores there, each:

- Image Frame extracted (png, jpg)

- Audio clip (as mp4)

- Embedding indexes with PGVector

- Text Chunks, JSONs and other metadata.

So yes, once you process something with pixeltable, all the data is persisted.

Expand full comment

Reply (1)

mohamed sheded

if I'm using google colab , i didnt see any folder/files for the persistent memory by pixeltable , should i specify any attributes for that ?

Expand full comment

Reply (1)

Alex Razvant

You can configure the PIXELTABLE_HOME env variable for the cache path.

Please see the docs here: https://docs.pixeltable.com/docs/overview/configuration

Expand full comment

Valentin Jimenez

Nice! You are digging deep into the papers. Awesome! I have a question, in pixeltable, we are using the model CLIP, particularly in the caption creation. However, in one of the articles, it's mentioned that BLIP has more powerful capabilities, right? BLIP is currently not available in pixeltable, the available models can be found here:

https://docs.pixeltable.com/docs/integrations/models#clip-models

To my understanding, BLIP is more powerful, at least in the sense that if we provide more noisy data, it still will produce good results, if I understood the article correctly. So, if we used BLIP, I guess we could reduce the video quality even more, in order to reduce the costs when calling the gpt-4o-mini model for captions. Is my understanding correct?

Expand full comment

Reply (1)

Alex Razvant

Hey Valentin,

You got it mostly right!

CLIP and BLIP are similar, but different in the tasks they're doing.

In this project, we indeed use CLIP but not for captioning. We use it to generate embeddings of entire frames instead, such that we could search our image index based on an image provided by the user.

Clip is a contrastive model, and its main power is creating high-level embeddings that better match an image with its text description, the pre-trained variant is better at classifying rather than describing.

Blip on the other hand, is downstream fine-tuned for VQA (Visual Question Answering), meaning that we can prompt it - and it returns the image caption, aligned to the prompt.

For example, with CLIP if we pass in an image of a dog, it'll say "a photo of a dog", without us being able to prompt it with "Describe what's in the image, in rich detail".

For BLIP, we could pass the image + the prompt, and it'll get us a caption that's aligned with the prompt.This is enabled because BLIP was specifically adapted for VQA.

Important: However, I explained BLIP as an example of how VLMs for image captioning (VQA) work, but we're not using the BLIP model to do that. We use GPT-4o-mini for that.

To reduce costs, of image captioning we resized the image from original size, which is mostly FullHD (1920x1080) in videos, to a fixed resolution of 1024x768. That would mean less tokens GPT4o-mini has to process which equals to less cost.

To remove image captioning costs altogether, we could switch to do that locally:

- Implement a class that loads the BLIP model

- Define a predict() endpoint that takes in a PIL.Image and returns a string.

- Add this predict() method as a computed_column to our table, replacing the openai.vision call we're doing.

Hope it helped ;)

Expand full comment

Reply (1)

Valentin Jimenez

Awesome, Alex! Thanks for taking the time to read and reply. This clarified a lot. I get what you mean now.

I think I missed some points because I was rushing between your article, the CLIP and BLIP papers 😅. I probably need to slow down a bit 😆, but these multimodal topics are really exciting.

By the way, I noticed there’s so much research and many more approaches in the multimodal case, which is expected! For example, in your course, you present the idea of creating separate indexes. I started wondering if, instead, there could be a way to create a single latent space that combines audio, image, and text. While looking into this, I came across this (very recent) paper:

https://arxiv.org/pdf/2502.03897

I haven’t read it yet, but it seems to explore that idea. All in all, what I wanted to say is that another idea for your posts and that would also help everyone, in my opinion, is a course or article on how to go from an idea → to a paper → to a GitHub repo → to your own implementation, including how to choose the right tools.

Anyway, just thought I’d share the idea, and you might already have thought of it.

Thanks again for your replies; they were super insightful! 🙏🙏🙏

Expand full comment