NVIDIA Takes #1 in Document Search: Nemotron ColEmbed V2 Released

Achieved #1 Overall on ViDoRe V3 Benchmark

  • Scored NDCG@10 of 63.42, ranking #1 overall on ViDoRe V3 benchmark
  • Available in three model sizes: 3B, 4B, and 8B for diverse use cases
  • Late-Interaction approach enables simultaneous text and image search

What Happened?

NVIDIA released Nemotron ColEmbed V2, a multimodal document search model.[Hugging Face] This model specializes in Visual Document Retrieval, searching documents containing visual elements using text queries. It achieved #1 overall on the ViDoRe V3 benchmark with an NDCG@10 score of 63.42.[NVIDIA]

The model comes in three sizes. The 8B model delivers top performance (63.42), the 4B ranks 3rd with 61.54, and the 3B ranks 6th with 59.79. It uses ColBERT-style Late-Interaction mechanism to calculate precise similarity at the token level.[Hugging Face]

Why Does It Matter?

Enterprise documents are not just text. They contain tables, charts, and infographics. Traditional text-based search misses these visual elements. Nemotron ColEmbed V2 understands both images and text together, improving search accuracy.

This is particularly valuable for RAG (Retrieval-Augmented Generation) systems. Before an LLM generates a response, it needs to find relevant documents. The accuracy of this retrieval step determines the final response quality. Key improvements over V1 include advanced model merging techniques and multilingual synthetic data training.

What Comes Next?

Multimodal search is becoming a necessity, not an option. NVIDIA plans to integrate this model into its NeMo Retriever product line. Competition in document search accuracy for enterprise RAG pipelines is about to intensify. However, the Late-Interaction approach requires storing token-level embeddings, which means higher storage costs.

Frequently Asked Questions (FAQ)

Q: What is Late-Interaction?

A: Traditional embedding models compress an entire document into a single vector. Late-Interaction creates separate vectors for each token and sums the maximum similarity between query tokens and document tokens. It is more precise but requires more storage space.

Q: Which model size should I choose?

A: Use the 8B model if accuracy is the top priority. The 4B offers a good balance between cost and speed. The 3B still provides top-tier performance in resource-constrained environments. All are available for free on Hugging Face.

Q: Can I apply this to existing RAG systems?

A: Yes. Load it via Hugging Face Transformers and replace the embedding model in your existing pipeline. You may need to adjust the vector DB indexing method due to Late-Interaction characteristics. NVIDIA NGC also provides containers.


If you found this article useful, please subscribe to AI Digester.

References

Leave a Comment