We’re excited to announce that CocoIndex now supports native integration with ColPali — enabling multi-vector, patch-level image indexing using cutting-edge multimodal models. CocoIndex now supports native integration with ColPali multi-vector, patch-level image indexing With just a few lines of code, you can now embed and index images with ColPali’s late-interaction architecture, fully integrated into CocoIndex’s composable flow system. Why ColPali for Indexing? ColPali (Contextual Late-interaction over Patches) is a powerful model for multimodal retrieval. ColPali (Contextual Late-interaction over Patches) It fundamentally rethinks how documents—especially visually complex or image-rich ones—are represented and searched. Instead of reducing each image or page to a single dense vector (as in traditional bi-encoders), ColPali breaks an image into many smaller patches, preserving local spatial and semantic structure. Each patch receives its own embedding, which together form a multi-vector representation of the complete document. Fine-Grained Visual Search: Each image is split into a grid (commonly 32x32, generating 1,024 patches per page), and every patch is embedded with contextual awareness of both visual and textual cues. During search, user queries are also broken down into token embeddings, allowing matching of specific textual tokens to the most relevant image patches. This supports matching at a much finer spatial and semantic granularity than single-vector models. Preservation of Spatial and Semantic Structure: Traditional methods collapse document images into global vectors, losing vital layout and region-based context. ColPali’s patch embeddings retain spatial relationships and can localize query matches (e.g., finding a diagram in a manual or a table on a form), making the search results more accurate and interpretable. High Recall Across Object-Rich Scenes: Scenes with multiple objects, dense text, graphics, or mixed content benefit because the model does not “forget” small but important regions. Each patch is individually considered, reducing the likelihood of missing relevant information even in visually cluttered pages. Advanced Search Strategies (Late Interaction & MaxSim): ColPali leverages the late interaction (LI) paradigm, inspired by ColBERT, where query tokens are compared against all patch embeddings from a document. The MaxSim operation keeps only the maximum similarity score for each query token, then sums these for the final relevance score. This enables precise, interpretable, and efficient retrieval, handling both large-scale searches and nuanced queries. The use of LI also reduces the computational burden at query time by avoiding up-front dense cross-attention or joint encoding, making retrieval both fast and accurate. Bypassing Traditional OCR Pipelines: Because images are processed natively, there is no need for error-prone text extraction or segmentation steps, boosting speed and end-to-end efficiency. This approach can also capture visual elements that OCR skips, such as charts, drawings, or logos. Scalability and Storage Efficiency: Compression schemes such as quantization and pruning (Hierarchical Patch Compression, HPC-ColPali) can further shrink storage needs and speed up similarity computation, allowing ColPali-based systems to scale to billions of documents or on-device retrieval without losing accuracy. Fine-Grained Visual Search: Each image is split into a grid (commonly 32x32, generating 1,024 patches per page), and every patch is embedded with contextual awareness of both visual and textual cues. During search, user queries are also broken down into token embeddings, allowing matching of specific textual tokens to the most relevant image patches. This supports matching at a much finer spatial and semantic granularity than single-vector models. Fine-Grained Visual Search: Each image is split into a grid (commonly 32x32, generating 1,024 patches per page), and every patch is embedded with contextual awareness of both visual and textual cues. During search, user queries are also broken down into token embeddings, allowing matching of specific textual tokens to the most relevant image patches. This supports matching at a much finer spatial and semantic granularity than single-vector models. Each image is split into a grid (commonly 32x32, generating 1,024 patches per page), and every patch is embedded with contextual awareness of both visual and textual cues. During search, user queries are also broken down into token embeddings, allowing matching of specific textual tokens to the most relevant image patches. This supports matching at a much finer spatial and semantic granularity than single-vector models. Preservation of Spatial and Semantic Structure: Traditional methods collapse document images into global vectors, losing vital layout and region-based context. ColPali’s patch embeddings retain spatial relationships and can localize query matches (e.g., finding a diagram in a manual or a table on a form), making the search results more accurate and interpretable. Preservation of Spatial and Semantic Structure: Traditional methods collapse document images into global vectors, losing vital layout and region-based context. ColPali’s patch embeddings retain spatial relationships and can localize query matches (e.g., finding a diagram in a manual or a table on a form), making the search results more accurate and interpretable. Traditional methods collapse document images into global vectors, losing vital layout and region-based context. ColPali’s patch embeddings retain spatial relationships and can localize query matches (e.g., finding a diagram in a manual or a table on a form), making the search results more accurate and interpretable. High Recall Across Object-Rich Scenes: Scenes with multiple objects, dense text, graphics, or mixed content benefit because the model does not “forget” small but important regions. Each patch is individually considered, reducing the likelihood of missing relevant information even in visually cluttered pages. High Recall Across Object-Rich Scenes: Scenes with multiple objects, dense text, graphics, or mixed content benefit because the model does not “forget” small but important regions. Each patch is individually considered, reducing the likelihood of missing relevant information even in visually cluttered pages. Scenes with multiple objects, dense text, graphics, or mixed content benefit because the model does not “forget” small but important regions. Each patch is individually considered, reducing the likelihood of missing relevant information even in visually cluttered pages. Advanced Search Strategies (Late Interaction & MaxSim): ColPali leverages the late interaction (LI) paradigm, inspired by ColBERT, where query tokens are compared against all patch embeddings from a document. The MaxSim operation keeps only the maximum similarity score for each query token, then sums these for the final relevance score. This enables precise, interpretable, and efficient retrieval, handling both large-scale searches and nuanced queries. The use of LI also reduces the computational burden at query time by avoiding up-front dense cross-attention or joint encoding, making retrieval both fast and accurate. Advanced Search Strategies (Late Interaction & MaxSim): ColPali leverages the late interaction (LI) paradigm, inspired by ColBERT, where query tokens are compared against all patch embeddings from a document. The MaxSim operation keeps only the maximum similarity score for each query token, then sums these for the final relevance score. This enables precise, interpretable, and efficient retrieval, handling both large-scale searches and nuanced queries. The use of LI also reduces the computational burden at query time by avoiding up-front dense cross-attention or joint encoding, making retrieval both fast and accurate. ColPali leverages the late interaction (LI) paradigm, inspired by ColBERT, where query tokens are compared against all patch embeddings from a document. The MaxSim operation keeps only the maximum similarity score for each query token, then sums these for the final relevance score. This enables precise, interpretable, and efficient retrieval, handling both large-scale searches and nuanced queries. The use of LI also reduces the computational burden at query time by avoiding up-front dense cross-attention or joint encoding, making retrieval both fast and accurate. Bypassing Traditional OCR Pipelines: Because images are processed natively, there is no need for error-prone text extraction or segmentation steps, boosting speed and end-to-end efficiency. This approach can also capture visual elements that OCR skips, such as charts, drawings, or logos. Bypassing Traditional OCR Pipelines: Because images are processed natively, there is no need for error-prone text extraction or segmentation steps, boosting speed and end-to-end efficiency. This approach can also capture visual elements that OCR skips, such as charts, drawings, or logos. Because images are processed natively, there is no need for error-prone text extraction or segmentation steps, boosting speed and end-to-end efficiency. This approach can also capture visual elements that OCR skips, such as charts, drawings, or logos. Scalability and Storage Efficiency: Compression schemes such as quantization and pruning (Hierarchical Patch Compression, HPC-ColPali) can further shrink storage needs and speed up similarity computation, allowing ColPali-based systems to scale to billions of documents or on-device retrieval without losing accuracy. Scalability and Storage Efficiency: Compression schemes such as quantization and pruning (Hierarchical Patch Compression, HPC-ColPali) can further shrink storage needs and speed up similarity computation, allowing ColPali-based systems to scale to billions of documents or on-device retrieval without losing accuracy. Compression schemes such as quantization and pruning (Hierarchical Patch Compression, HPC-ColPali) can further shrink storage needs and speed up similarity computation, allowing ColPali-based systems to scale to billions of documents or on-device retrieval without losing accuracy. Declare an Image Indexing Flow with CocoIndex / Qdrant Indexing Images with ColPali in CocoIndex This flow illustrates how we’ll process and index images using ColPali: Ingest image files from the local filesystem Use ColPali to embed each image into patch-level multi-vectors Optionally extract image captions using an LLM Export the embeddings (and optional captions) to a Qdrant collection Ingest image files from the local filesystem Use ColPali to embed each image into patch-level multi-vectors ColPali Optionally extract image captions using an LLM Export the embeddings (and optional captions) to a Qdrant collection 1. Ingest the Images We start by defining a flow to read .jpg, .jpeg, and .png files from a local directory using LocalFile. .jpg .jpeg .png LocalFile @cocoindex.flow_def(name="ImageObjectEmbeddingColpali") def image_object_embedding_flow(flow_builder, data_scope): data_scope["images"] = flow_builder.add_source( cocoindex.sources.LocalFile( path="img", included_patterns=["*.jpg", "*.jpeg", "*.png"], binary=True ), refresh_interval=datetime.timedelta(minutes=1), ) @cocoindex.flow_def(name="ImageObjectEmbeddingColpali") def image_object_embedding_flow(flow_builder, data_scope): data_scope["images"] = flow_builder.add_source( cocoindex.sources.LocalFile( path="img", included_patterns=["*.jpg", "*.jpeg", "*.png"], binary=True ), refresh_interval=datetime.timedelta(minutes=1), ) The add_source function sets up a table with fields like filename and content. Images are automatically re-scanned every minute. add_source filename content 2. Process Each Image and Collect the Embedding 2.1 Embed the Image with ColPali We use CocoIndex's built-in ColPaliEmbedImage function, which returns a multi-vector representation for each image. Each patch receives its own vector, preserving spatial and semantic information. ColPaliEmbedImage multi-vector representation colpali_embed = cocoindex.functions.ColPaliEmbedImage(model="vidore/colpali-v1.2") colpali_embed = cocoindex.functions.ColPaliEmbedImage(model="vidore/colpali-v1.2") Inside the flow: img_embeddings = data_scope.add_collector() with data_scope["images"].row() as img: img["embedding"] = img["content"].transform(colpali_embed) img_embeddings = data_scope.add_collector() with data_scope["images"].row() as img: img["embedding"] = img["content"].transform(colpali_embed) This transformation turns the raw image bytes into a list of vectors — one per patch — that can later be used for late interaction search. late interaction search 3. Collect and Export the Embeddings Once we’ve processed each image, we collect its metadata and embedding and send it to Qdran collect_fields = { "id": cocoindex.GeneratedField.UUID, "filename": img["filename"], "embedding": img["embedding"], } if ollama_model_name is not None: collect_fields["caption"] = img["caption"] img_embeddings.collect(**collect_fields) collect_fields = { "id": cocoindex.GeneratedField.UUID, "filename": img["filename"], "embedding": img["embedding"], } if ollama_model_name is not None: collect_fields["caption"] = img["caption"] img_embeddings.collect(**collect_fields) Then we export to Qdrant using the Qdrant target: Qdrant img_embeddings.export( "img_embeddings", cocoindex.targets.Qdrant(collection_name="ImageSearchColpali"), primary_key_fields=["id"], ) img_embeddings.export( "img_embeddings", cocoindex.targets.Qdrant(collection_name="ImageSearchColpali"), primary_key_fields=["id"], ) This creates a vector collection in Qdrant that supports multi-vector fields — required for ColPali-style late interaction search. multi-vector fields 4. Enable Real-Time Indexing To keep the image index up to date automatically, we wrap the flow in a FlowLiveUpdater: FlowLiveUpdater @asynccontextmanager async def lifespan(app: FastAPI): load_dotenv() cocoindex.init() image_object_embedding_flow.setup(report_to_stdout=True) app.state.live_updater = cocoindex.FlowLiveUpdater(image_object_embedding_flow) app.state.live_updater.start() yield @asynccontextmanager async def lifespan(app: FastAPI): load_dotenv() cocoindex.init() image_object_embedding_flow.setup(report_to_stdout=True) app.state.live_updater = cocoindex.FlowLiveUpdater(image_object_embedding_flow) app.state.live_updater.start() yield This keeps your vector index fresh as new images arrive. 🧬 What’s Actually Stored? Unlike typical image search pipelines that store one global vector per image, ColPali stores: Vector[Vector[Float32, N]] Vector[Vector[Float32, N]] Where: The outer dimension is the number of patches The inner dimension is the model’s hidden size The outer dimension is the number of patches number of patches The inner dimension is the model’s hidden size model’s hidden size This makes the index multi-vector ready, and compatible with late-interaction query strategies — like MaxSim or learned fusion. multi-vector ready 🔌 Real-Time Indexing with Live Updater You can also attach CocoIndex’s FlowLiveUpdater to your FastAPI or any Python app to keep your ColPali index synced in real time: FlowLiveUpdater from fastapi import FastAPI from contextlib import asynccontextmanager @asynccontextmanager async def lifespan(app: FastAPI): load_dotenv() cocoindex.init() image_object_embedding_flow.setup(report_to_stdout=True) app.state.live_updater = cocoindex.FlowLiveUpdater(image_object_embedding_flow) app.state.live_updater.start() yield from fastapi import FastAPI from contextlib import asynccontextmanager @asynccontextmanager async def lifespan(app: FastAPI): load_dotenv() cocoindex.init() image_object_embedding_flow.setup(report_to_stdout=True) app.state.live_updater = cocoindex.FlowLiveUpdater(image_object_embedding_flow) app.state.live_updater.start() yield 🌳 Retrivel and application Refer to this example on Query and application building: https://cocoindex.io/blogs/live-image-search#3-query-the-index https://cocoindex.io/blogs/live-image-search#3-query-the-index Make sure we use ColPali to embed the query @app.get("/search") def search( q: str = Query(..., description="Search query"), limit: int = Query(5, description="Number of results"), ) -> Any: # Get the multi-vector embedding for the query query_embedding = text_to_colpali_embedding.eval(q) @app.get("/search") def search( q: str = Query(..., description="Search query"), limit: int = Query(5, description="Number of results"), ) -> Any: # Get the multi-vector embedding for the query query_embedding = text_to_colpali_embedding.eval(q) Built with Flexibility in Mind Whether you’re working on: Visual RAG Multimodal retrieval systems Fine-grained visual search tools Or want to bring image understanding to your AI agent workflows Visual RAG Multimodal retrieval systems Fine-grained visual search tools Or want to bring image understanding to your AI agent workflows CocoIndex + ColPali gives you a modular, modern foundation to build from. We’re constantly adding more examples and improving our runtime. If you found this helpful, please ⭐ star CocoIndex on GitHub and share it with others. CocoIndex on GitHub Suggestions for more native ‘LEGO’ pieces? Just let us know! We are moving full speed ahead to support you!