Flash-Cascade — Case study

Problem

Pure dense retrieval was too expensive for every query, while sparse-only retrieval missed semantic intent for harder long-tail searches.

Constraints

Strict sub-150 ms latency targets, billion-scale indexing pressure, and near real-time freshness without expensive full re-indexing cycles.

Approach

Combined BM25 + Block-Max WAND sparse retrieval with ANN dense retrieval and a query-budget router deciding when dense expansion is worth the cost.

Outcome

Delivered a hybrid pipeline with fast sparse-first paths for most traffic, semantic recovery on difficult queries, and 1-2 second freshness via hot reload mechanisms.

What I’d redo

Add broader offline relevance benchmarking and query-cluster level dashboards earlier to tune routing thresholds with less manual iteration.