Information retrieval systems help us find relevant documents from large collections of text. At the heart of modern retrieval systems lies vector similarity - the idea that we can represent text as mathematical vectors and measure how "similar" they are.
The first challenge is converting text into numbers that computers can work with. One simple approach is bag-of-words: we count how many times each word appears in a document. For example, "Python is great" becomes {"python": 1, "is": 1, "great": 1}. More sophisticated methods like neural embeddings capture semantic meaning, but the core principle remains the same.
Once we have vectors, we need to normalize them. Think of this like adjusting for document length - a long document might have higher word counts simply because it's longer, not because it's more relevant. Normalization ensures all vectors have the same "magnitude" so we're comparing content, not length.
Cosine similarity is the gold standard for comparing text vectors. It measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction (similar content) have cosine similarity of 1.0, while perpendicular vectors (unrelated content) score 0.0.
The mathematical formula is: cosine_similarity = dot_product(A, B) / (magnitude(A) × magnitude(B)). The beauty of cosine similarity is that it focuses on the pattern of words rather than their absolute frequencies.
RAG (Retrieval-Augmented Generation) systems use this concept to find relevant information before generating responses. The process is straightforward: embed the user's query, embed all candidate documents, calculate similarities, and return the top matches. This "retrieve then generate" approach has revolutionized AI applications by grounding responses in specific, relevant content.
1import math
2from collections import Counter
3
4def create_vector(text):
5 words = text.lower().split()
6 counts = Counter(words)
7 # Normalize by magnitude
8 magnitude = math.sqrt(sum(count**2 for count in counts.values()))
9 return {word: count/magnitude for word, count in counts.items()}
10
11def cosine_similarity(vec1, vec2):
12 common_words = set(vec1.keys()) & set(vec2.keys())
13 dot_product = sum(vec1[word] * vec2[word] for word in common_words)
14 return dot_product # Already normalized
15
16# Example usage
17doc1 = create_vector("machine learning algorithms")
18doc2 = create_vector("learning new algorithms")
19print(f"Similarity: {cosine_similarity(doc1, doc2):.3f}") # High similarity