Building Semantic Search for 1M+ Documents
The Challenge
A legal tech startup needed to help lawyers find relevant case law from over 1 million court documents. Traditional keyword search was failing - legal language is complex, and lawyers often can't remember exact phrases from cases they need.
Key Results
Implemented a production-grade semantic search system using vector embeddings, achieving <100ms p99 latency and 3x better recall than keyword search.
The Problem: Keywords Don't Understand Context
The existing system used Elasticsearch with keyword matching. It worked like Google circa 1998.
The Failures
- Search "vehicular homicide" → Miss cases that say "death by dangerous driving"
- Search "breach of contract" → Get thousands of irrelevant results
- Can't search by concept: "cases where defendant claimed duress"
- No understanding of legal synonyms or related concepts
Lawyers were spending hours manually filtering results or missing critical precedents entirely.
The Solution: Semantic Embeddings + Hybrid Search
We replaced keyword-only search with a hybrid system that understands meaning.
Phase 1: Choosing the Right Embedding Model
Not all embedding models are equal, especially for legal documents.
Models Tested:
Good general purpose, but expensive at scale
Fast but poor on legal jargon
Domain-specific, but slow
Winner: Good balance of speed (50ms per document) and accuracy
Why It Won
- Good balance of speed (50ms per document) and accuracy
- Fine-tuning on 50K legal document pairs improved recall by 40%
- 384 dimensions (smaller vector = faster search)
- Can run on CPU in production
Phase 2: Chunking Strategy
1 million documents × average 50 pages = death by memory overflow.
The Strategy
- Split documents logically: By section headings, not arbitrary character counts
- Chunk size: 512 tokens (~400 words) - sweet spot for legal reasoning
- Overlap: 50-token overlap between chunks to avoid context loss
- Metadata preservation: Each chunk knows its parent document, page number, section
Total chunks: ~8 million (from 1M documents)
Phase 3: Vector Database Selection
Tested Pinecone, Weaviate, Milvus, and Qdrant.
Chose Qdrant because
- Self-hosted (client requirement for data privacy)
- Fast filtering on metadata (critical for legal search)
- HNSW index for sub-100ms queries
- Handles 8M vectors without breaking a sweat
- Rust-based (stupid fast)
Phase 4: Hybrid Search Architecture
Vector search alone isn't enough. We combined it with traditional search for best results.
async def hybrid_search(query: str, filters: dict):
# 1. Generate query embedding
query_embedding = model.encode(query)
# 2. Vector search (semantic similarity)
semantic_results = await qdrant.search(
collection="legal_docs",
query_vector=query_embedding,
limit=100,
filter=filters # jurisdiction, date range, etc.
)
# 3. Keyword search (for exact matches)
keyword_results = await elasticsearch.search(
index="legal_docs",
query=query,
size=100
)
# 4. Reciprocal Rank Fusion (combine results)
combined = rank_fusion(semantic_results, keyword_results)
# 5. Re-rank top 20 with cross-encoder
final_results = await rerank(combined[:20], query)
return final_results[:10]Why This Works
- Vector search catches semantic matches ("vehicular homicide" = "death by dangerous driving")
- Keyword search ensures exact citations aren't missed
- Re-ranking with cross-encoder improves precision
Performance Optimization
Getting from "it works" to "it works in production" required optimization.
Challenge 1: Latency Spikes
P50 was 30ms, but P99 was 800ms
- Index warming
- Query caching (Redis)
- Connection pooling
Result: P99 < 100ms
Challenge 2: Indexing Speed
Initial indexing took 40 hours
- Batch embedding (256 docs)
- 8 parallel workers
- Bulk upload (1000 batches)
Result: 4 hours
Challenge 3: Storage Costs
8M vectors = 12GB
- Scalar quantization
- 4x storage reduction
- <2% accuracy loss
Result: 3GB total
The Outcome
Launched to 500 lawyers in beta, then full rollout.
Keyword vs Semantic Search
| Metric | Before | After |
|---|---|---|
| Relevant Results in Top 10 | 4.2 | 8.7 |
| Query Latency (P99) | 120ms | 95ms |
| User Satisfaction | 52% | 89% |
| Time to Find Case | 18 min | 6 min |
Business Impact
- 3x increase in platform usage
- 67% reduction in support tickets about "can't find case"
- Became primary competitive differentiator in sales
This isn't just better search - it's like having a junior associate who's read every case in the database.
The Lesson
Semantic search is table stakes for any document-heavy application in 2024. But production deployment isn't just pip install sentence-transformers.
What I Learned
- Domain-specific matters: Fine-tuning on legal data improved results more than switching to a "better" general model
- Hybrid > Pure: Combining semantic + keyword search beats either alone
- Latency is a feature: Users will tolerate 5% less accuracy for 5x faster results
- Quantization is free money: 4x storage reduction with negligible accuracy loss
If you're still using only keyword search for documents, you're leaving money on the table.
Want similar results?
Book a free 15-minute consultation to discuss your project, or get a $500 quick audit.
💳 No payment required to book • 📅 Free 15-min discovery call