Traditional search engines work by matching keywords. If you search for "ways to make a car go faster," a keyword search might miss a great article titled "How to improve automobile performance." This is where semantic search comes in.

Semantic search seeks to understand the intent and contextual meaning of a query to find the most relevant results, even if they don't share any keywords. The technology that powers this is embeddings.

1. What are Sentence Embeddings?

While models like Word2Vec create vectors for single words, modern Transformer models (like BERT) can create a single, rich vector representation for an entire sentence, paragraph, or document. This vector is called an embedding.

This embedding is a dense list of numbers (e.g., 768 numbers long) that captures the nuanced meaning of the text. Semantically similar sentences will have embeddings that are "close" to each other in high-dimensional space.

  • "The cat sat on the mat." -> [0.1, 0.9, 0.2, ...]
  • "A feline was resting on the rug." -> [0.12, 0.88, 0.21, ...] (Very similar vector)
  • "The stock market is up today." -> [0.8, -0.4, 0.6, ...] (Very different vector)

2. The Semantic Search Pipeline

The process is straightforward:

  1. Indexing: a. Take your collection of documents (your knowledge base). b. For each document, use a sentence-transformer model to generate an embedding vector. c. Store all these vectors in a specialized database called a vector database (e.g., Pinecone, Weaviate, FAISS). This database is optimized for finding the "nearest neighbors" in a high-dimensional space.
  2. Querying: a. A user enters a search query. b. Use the same sentence-transformer model to generate an embedding for the user's query. c. Take the query vector and search the vector database to find the document vectors that are closest to it.
  3. Results: Return the original documents corresponding to the closest vectors found. These are your semantically relevant results.

3. Measuring Closeness: Cosine Similarity

How do we measure if two vectors are "close" in a 768-dimensional space? We can't use simple Euclidean distance. Instead, we use cosine similarity.

Cosine similarity measures the cosine of the angle between two vectors. It doesn't care about the magnitude of the vectors, only their orientation.

  • If two vectors point in the exact same direction, their cosine similarity is 1 (identical meaning).
  • If they are orthogonal (unrelated), their similarity is 0.
  • If they point in opposite directions, their similarity is -1.

In semantic search, we are looking for the document vectors with the highest cosine similarity to our query vector.

Code Example with sentence-transformers

The sentence-transformers library makes it incredibly easy to create high-quality embeddings.

Python


from sentence_transformers import SentenceTransformer, util
import torch

# 1. Load a pretrained sentence-transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our document corpus
corpus = [
    'A man is eating food.',
    'A man is eating a piece of bread.',
    'The girl is carrying a baby.',
    'A man is riding a horse.',
    'A woman is playing violin.',
    'Two men pushed carts through the woods.',
    'A man is riding a white horse on an enclosed ground.',
    'A monkey is playing drums.',
    'A cheetah is running behind its prey.'
]

# 2. Generate embeddings for the corpus
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# 3. User query
query = 'A man on a horse'

# 4. Generate embedding for the query
query_embedding = model.encode(query, convert_to_tensor=True)

# 5. Find the most similar sentences in the corpus
# util.cos_sim computes the cosine similarity between the query and all corpus sentences
cosine_scores = util.cos_sim(query_embedding, corpus_embeddings)

# Get the top 3 results
top_results = torch.topk(cosine_scores, k=3)

print(f"Query: {query}\n")
print("Top 3 most similar sentences in corpus:")
for score, idx in zip(top_results[0][0], top_results[1][0]):
    print(f"- {corpus[idx]} (Score: {score:.4f})")