Building Semantic Search for 1M+ Documents

Client

Legal Tech Startup

Role

Senior Backend Engineer

Timeline

6 Weeks

Tech Stack

PythonFastAPIQdrantSentence TransformersPostgreSQLElasticsearch

The Challenge

A legal tech startup needed to help lawyers find relevant case law from over 1 million court documents. Traditional keyword search was failing - legal language is complex, and lawyers often can't remember exact phrases from cases they need.

Key Results

1M+

Documents Indexed

<100ms

P99 Latency

Better Recall

Implemented a production-grade semantic search system using vector embeddings, achieving <100ms p99 latency and 3x better recall than keyword search.

The Problem: Keywords Don't Understand Context

The existing system used Elasticsearch with keyword matching. It worked like Google circa 1998.

⚠️

The Failures

Search "vehicular homicide" → Miss cases that say "death by dangerous driving"
Search "breach of contract" → Get thousands of irrelevant results
Can't search by concept: "cases where defendant claimed duress"
No understanding of legal synonyms or related concepts

Lawyers were spending hours manually filtering results or missing critical precedents entirely.

The Solution: Semantic Embeddings + Hybrid Search

We replaced keyword-only search with a hybrid system that understands meaning.

Phase 1: Choosing the Right Embedding Model

Not all embedding models are equal, especially for legal documents.

Models Tested:

❌

text-embedding-3-small (OpenAI)

Good general purpose, but expensive at scale

❌

all-MiniLM-L6-v2

Fast but poor on legal jargon

❌

legal-bert-base-uncased

Domain-specific, but slow

✅

Fine-tuned all-mpnet-base-v2

Winner: Good balance of speed (50ms per document) and accuracy

✅

Why It Won

Good balance of speed (50ms per document) and accuracy
Fine-tuning on 50K legal document pairs improved recall by 40%
384 dimensions (smaller vector = faster search)
Can run on CPU in production

Phase 2: Chunking Strategy

1 million documents × average 50 pages = death by memory overflow.

💡

The Strategy

Split documents logically: By section headings, not arbitrary character counts
Chunk size: 512 tokens (~400 words) - sweet spot for legal reasoning
Overlap: 50-token overlap between chunks to avoid context loss
Metadata preservation: Each chunk knows its parent document, page number, section

Total chunks: ~8 million (from 1M documents)

Phase 3: Vector Database Selection

Tested Pinecone, Weaviate, Milvus, and Qdrant.

✅

Chose Qdrant because

Self-hosted (client requirement for data privacy)
Fast filtering on metadata (critical for legal search)
HNSW index for sub-100ms queries
Handles 8M vectors without breaking a sweat
Rust-based (stupid fast)

Phase 4: Hybrid Search Architecture

Vector search alone isn't enough. We combined it with traditional search for best results.

Hybrid Search Pipeline python

async def hybrid_search(query: str, filters: dict):
    # 1. Generate query embedding
    query_embedding = model.encode(query)

    # 2. Vector search (semantic similarity)
    semantic_results = await qdrant.search(
        collection="legal_docs",
        query_vector=query_embedding,
        limit=100,
        filter=filters  # jurisdiction, date range, etc.
    )

    # 3. Keyword search (for exact matches)
    keyword_results = await elasticsearch.search(
        index="legal_docs",
        query=query,
        size=100
    )

    # 4. Reciprocal Rank Fusion (combine results)
    combined = rank_fusion(semantic_results, keyword_results)

    # 5. Re-rank top 20 with cross-encoder
    final_results = await rerank(combined[:20], query)

    return final_results[:10]

💭

Why This Works

Vector search catches semantic matches ("vehicular homicide" = "death by dangerous driving")
Keyword search ensures exact citations aren't missed
Re-ranking with cross-encoder improves precision

Performance Optimization

Getting from "it works" to "it works in production" required optimization.

Challenge 1: Latency Spikes

P50 was 30ms, but P99 was 800ms

Index warming
Query caching (Redis)
Connection pooling

Result: P99 < 100ms

Challenge 2: Indexing Speed

Initial indexing took 40 hours

Batch embedding (256 docs)
8 parallel workers
Bulk upload (1000 batches)

Result: 4 hours

Challenge 3: Storage Costs

8M vectors = 12GB

Scalar quantization
4x storage reduction
<2% accuracy loss

Result: 3GB total

The Outcome

Launched to 500 lawyers in beta, then full rollout.

Keyword vs Semantic Search

Metric	Before	After
Relevant Results in Top 10	4.2	8.7
Query Latency (P99)	120ms	95ms
User Satisfaction	52%	89%
Time to Find Case	18 min	6 min

✅

Business Impact

3x increase in platform usage
67% reduction in support tickets about "can't find case"
Became primary competitive differentiator in sales

This isn't just better search - it's like having a junior associate who's read every case in the database.

P

Product Lead

The Lesson

Semantic search is table stakes for any document-heavy application in 2024. But production deployment isn't just pip install sentence-transformers.

💭

What I Learned

Domain-specific matters: Fine-tuning on legal data improved results more than switching to a "better" general model
Hybrid > Pure: Combining semantic + keyword search beats either alone
Latency is a feature: Users will tolerate 5% less accuracy for 5x faster results
Quantization is free money: 4x storage reduction with negligible accuracy loss

If you're still using only keyword search for documents, you're leaving money on the table.

Want similar results?

Book a free 15-minute consultation to discuss your project, or get a $500 quick audit.

Book Free Call Book $500 Audit

💳 No payment required to book • 📅 Free 15-min discovery call