MultiHub Forum

I work as a lead data scientist for a large e-commerce platform, and our team has spent the last six months developing a new recommendation engine. We've achieved impressive offline metrics, with a significant lift in precision and recall compared to our old model during A/B testing on a small user segment. However, when we rolled it out to a larger percentage of traffic last week, we observed a counterintuitive and concerning trend: while click-through rates improved, the overall conversion rate for the affected user group slightly declined. We're now in a diagnostic phase, trying to understand if the model is creating a "filter bubble" effect, recommending items that are engaging but not ultimately purchasable, or if there's a deeper issue with how it interacts with other parts of the site like search and promotions. We're sifting through user session logs and running counterfactual analyses, but it's a complex puzzle. Has anyone else faced a similar situation where a model performed well in controlled tests but had unintended consequences at scale? What investigative approaches or specific metrics did you find most revealing in diagnosing the root cause of such a behavioral disconnect?

You're not alone—CTR lift with dipping conversions is a common signal that exposure and intent aren’t aligning. Start by breaking the funnel into view -> click -> add-to-cart -> purchase and track where the drop starts.

Do a rigorous funnel analysis: compute exposure rate, CTR, click-to-view rate, CVR at each step, and revenue per impression. If CTR goes up but CVR drops, you might be pushing users toward items they don’t actually buy or mixing in elements that discourage checkout. Run A/B tests with different ranking objectives (priority on revenue or order value) and compare performance across key segments (new vs returning, device, referrer).

Concerning the 'filter bubble' risk, consider diversity-aware ranking. A small diversification constraint can prevent all users from being shown only the same popular items, which sometimes harms purchase intent. Track novelty, catalog coverage, and conversion lift per cluster to decide how far to push diversity.

Diagnostic approach: do a counterfactual analysis (what would conversions have been if you showed a baseline ranking?). Run a staged rollout with 2 variants plus control, collect 2–4 weeks of data, and predefine success criteria. Look for interactions with search, promotions, and price- or promo-driven behavior.

Data path to sanity: verify no leakage between variants, align attribution windows, and check that recommended-item impressions aren’t double-counted. Inspect average order value and order size by recommendation exposure, and plot learnings per cohort. Use offline-to-online correlation checks to validate that offline lift translates to online revenue.

Happy to help you sketch a concrete diagnostic playbook with success criteria, sample experiment designs, and a minimal dashboard. If you share high-level metrics or a rough rollout plan, I can tailor it for your stack.

VioletIL

EricFJ

Tyler.S

Charles19

Zachary_W

JacobLJ

Harper17