Why Multimodal LLMs Still Struggle with Video

When I first started developing an AI agent for video analysis, I thought it would be easy-peasy – what an optimistic guy! My optimistic belief was that multimodal large language models (LLMs) like GPT-4o would be powerful enough to make such analysis on any video. But the reality hit harder than expected. Despite the hype, the reality is we’re still quite far from a “single-model magic” when it comes to complex unstructured data such as video and complex documents with text, images, graphs etc. Particularly for complex videos, even the best multimodal LLMs struggle to analyse. Especially if those videos include visual clutter, ambient sound, and spontaneous human speech etc. things become more complicated. I am hesitant to say this is a shortfall of intelligence, but more likely due to input readiness.

Let me walk you through what I learned while building an AI agent that can gather high-quality insights from a complex video, and why we need a full pipeline of pre-processing and reasoning modules to make it work.

Why Video Is So Hard for AI

Video is unstructured, data-rich, chaotic mess of signals and includes multimodalities. Each video file typically contains two major modalities:

Visual data consists of sequences of frames (images). Those images might be blurry, poorly lit, or contain rapid motion.
Audio data is a mix of voices, environmental noise, and sometimes music. Multimodal LLMs are incredibly good at reasoning across modalities, but only after you’ve extracted and cleaned the signal. Without proper pre-processing, even the best models will struggle to generate high accuracy outputs. The output will likely be vague or inaccurate if not pre-processed properly.

The Architecture of a Video-Understanding AI Agent

Here’s the agent design I developed (see diagram below). Each part plays a key role in structuring the mess before handing it off to the reasoning engine. It is not only structuring the data but also cleaning and enhancing where necessary.

Master Agent

Master Agent is the orchestrator of the entire workflow within the agent. It coordinates the pre-processing, feature extraction, enhancing, and reasoning tasks, ensuring data moves efficiently from raw input to final output. Video Pre-processing This is where we roll up our sleeves. The goal is to improve data quality before feeding anything into the LLM for reasoning.

Frame & Audio Extraction

First step is separating frames and audio components of a video by extracting them using specialised tool. I used ffmpeg for extraction as it is efficient, flexible, and reliable for handling raw video and audio.

Signal Enhancement

Second step is signal enhancement for quality improvement on each visual and audio dataset.

i) Visual Enhancement: Using OpenCV to upscale low-resolution and blurry frames as well as frame stabilisation, I improved frames quality. OpenCV is also a strong tool for frame resizing and format conversions.

ii) Audio Enhancement: Denoising and audio clarity of an audio data can be achieved by using RNNoise. Additionally, I used Librosa for audio feature extraction and ensuring clean and consistent speech data. Hence we can do accurate transcription from speech a the video.

Feature Extraction

i) Frame Embedding: To generate frame embeddings, I used OpenAI CLIP to extract meaningful and language-ready features from a video. Basically, it takes each frame as a pixel matrix and embeds them into a semantic vector space where they can be meaningfully described.

ii) Speech Recognition: Using Whisper v3, I converted speech-to-text to transcribe spoken content into text format. This is another important input to the LLMs.

Multimodal LLM Reasoning

After all those stages mentioned above, it comes to the GPT-4o for reasoning. Once we’ve prepared frame embeddings and transcripts, GPT-4o consumes them to gather insights about what’s happening in the video to understand if someone is speaking, what is being said, what objects are visible, and how everything connects within a context.

Scene Classification (for more accuracy)

This is an optional step, but it does help to improve output accuracy especially where compliance, safety, or business logic checks are required. Using a classifier called XGBoost, I added an additional layer to the AI Agent design to improve the level of confidence and accuracy in detecting specific scenes or events in each video.

Output Analysis

After all the pre-processing, enhancement, and reasoning steps mentioned above, we’re finally ready to translate structured insights from a given video. Our AI agent can now accurately detect scenes, summarize key moments, or tag items. It can also understand at scale which is very powerful capability in automation systems that use videos as input. The key lesson? Unfortunately, true intelligence cannot be achieved using existing multimodal LLMs for now. Hope they evolve quickly to save us from all that hurdle (It is not that as difficult as it seems though!).

Sorry! It's a Pipeline, Not a Shortcut

Multimodal LLMs are powerful but only when they’re handed clean, structured data as input. Complex, noisy, and low-quality video needs a pipeline of tools and pre-trained models before the LLM can work in high accuracy. The true video understanding in AI still requires orchestration and a system design. If you're working with unstructured data like video and expecting plug-and-play AI solution such as using a multimodal LLM with some prompting, prepare to be surprised—just like I was.

Why Multimodal LLMs Still Struggle with Video — And How AI Agents Can Help