RAG Text Retrieval with Cosine Similarity

hardPython

Lesson

Vector Similarity and Information Retrieval

Information retrieval systems help us find relevant documents from large collections of text. At the heart of modern retrieval systems lies vector similarity - the idea that we can represent text as mathematical vectors and measure how "similar" they are.

From Text to Vectors

The first challenge is converting text into numbers that computers can work with. One simple approach is bag-of-words: we count how many times each word appears in a document. For example, "Python is great" becomes {"python": 1, "is": 1, "great": 1}. More sophisticated methods like neural embeddings capture semantic meaning, but the core principle remains the same.

Once we have vectors, we need to normalize them. Think of this like adjusting for document length - a long document might have higher word counts simply because it's longer, not because it's more relevant. Normalization ensures all vectors have the same "magnitude" so we're comparing content, not length.

Measuring Similarity

Cosine similarity is the gold standard for comparing text vectors. It measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction (similar content) have cosine similarity of 1.0, while perpendicular vectors (unrelated content) score 0.0.

The mathematical formula is: cosine_similarity = dot_product(A, B) / (magnitude(A) × magnitude(B)). The beauty of cosine similarity is that it focuses on the pattern of words rather than their absolute frequencies.

Retrieval Systems

RAG (Retrieval-Augmented Generation) systems use this concept to find relevant information before generating responses. The process is straightforward: embed the user's query, embed all candidate documents, calculate similarities, and return the top matches. This "retrieve then generate" approach has revolutionized AI applications by grounding responses in specific, relevant content.

Example
1import math 2from collections import Counter 3 4def create_vector(text): 5 words = text.lower().split() 6 counts = Counter(words) 7 # Normalize by magnitude 8 magnitude = math.sqrt(sum(count**2 for count in counts.values())) 9 return {word: count/magnitude for word, count in counts.items()} 10 11def cosine_similarity(vec1, vec2): 12 common_words = set(vec1.keys()) & set(vec2.keys()) 13 dot_product = sum(vec1[word] * vec2[word] for word in common_words) 14 return dot_product # Already normalized 15 16# Example usage 17doc1 = create_vector("machine learning algorithms") 18doc2 = create_vector("learning new algorithms") 19print(f"Similarity: {cosine_similarity(doc1, doc2):.3f}") # High similarity
L6Normalization ensures all vectors have magnitude 1.0 for fair comparison
L10Only words present in both vectors contribute to the similarity score
L16These documents share 'learning' and 'algorithms', resulting in high similarity

Key Takeaways

  • •Text similarity relies on converting documents to normalized vectors that can be mathematically compared
  • •Cosine similarity measures the angle between vectors, focusing on content patterns rather than absolute word counts
  • •Retrieval systems rank documents by similarity scores, enabling efficient search through large text collections
Loading...