Problem
Problem
Pure dense retrieval was too expensive for every query, while sparse-only retrieval missed semantic intent for harder long-tail searches.
Constraints
Constraints
Strict sub-150 ms latency targets, billion-scale indexing pressure, and near real-time freshness without expensive full re-indexing cycles.
Approach
Approach
Combined BM25 + Block-Max WAND sparse retrieval with ANN dense retrieval and a query-budget router deciding when dense expansion is worth the cost.
Outcome
Outcome
Delivered a hybrid pipeline with fast sparse-first paths for most traffic, semantic recovery on difficult queries, and 1-2 second freshness via hot reload mechanisms.
What I’d redo
What I’d redo
Add broader offline relevance benchmarking and query-cluster level dashboards earlier to tune routing thresholds with less manual iteration.