70% Fewer Defects for AI-Powered Recommends

Improving model stability, data quality, and accuracy to recover a 20% lost click-through rate and deliver reliable product suggestions.

About project

Solution

AI model testing, Data quality validation, Model drift detection, Regression testing, Bias & relevance evaluation, Performance testing, Security testing

Technologies

Python, OpenAI GPT-4, scikit-learn, MLflow, SQL-based data profiling, Postman, Jira

Country

United States

Industry

eCommerce

Project Duration

7 months

Team Composition

1 QA Lead

5 Manual QAs

Challenge

The client’s recommendation engine had once been a key revenue driver, boosting user engagement and average order value. But over time, its performance declined sharply. Customers began seeing irrelevant or repetitive product suggestions, leading to frustration and fewer clicks. Internal analytics showed a drop of more than 20% in click-through rates, which directly impacted sales and marketing campaigns built around personalized recommendations.

A deeper investigation revealed several intertwined issues:

  • Model drift: The AI gradually moved away from its original accuracy as shopping patterns, product catalogs, and customer behavior evolved.
  • Poor data quality: Missing attributes, outdated product descriptions, duplicate listings, and mislabeled categories disrupted training signals.
  • Lack of structured QA: There were no formal test cases, no relevance scoring, and no clear way to measure AI quality beyond high-level engagement metrics.
  • Delayed detection: Problems surfaced only after revenue impact became visible, slowing down response time.

With the holiday shopping season approaching, leadership needed to stabilize recommendations quickly, stop revenue decline, and create a sustainable way to monitor and maintain model quality.

Solutions

Our comprehensive testing strategy involved four key approaches:

Manual testing. We explored the plugin’s functionality, simulating real-world user scenarios. This included testing various web layouts, phone number formats, and edge cases to ensure a smooth user experience.

Black-box testing. We verified the plugin’s inputs and outputs without access to the internal code. This involved testing phone number recognition accuracy, call initiation speed, and proper integration with the Skype application.

Regression testing. After each update, we re-ran our test suite to catch any new bugs or unintended side effects. This iterative process helped maintain plugin stability throughout development.

Localization testing. We checked the plugin’s functionality across multiple languages and regions, ensuring proper number formatting, currency symbol placement, and compatibility with international phone numbers.

Technologies

The chosen tools helped the team track AI quality over time, validate data, and run structured evaluations without relying on heavy automation. The stack supported flexible experiments, clear reporting, and collaboration between QA engineers and data teams.

  • Python
  • OpenAI GPT-4
  • Scikit-learn
  • MLflow
  • Postman
  • Jira

Types of testing

Usability testing

Evaluating clarity, navigation, and user experience of recommendation widgets.

Security testing

Assessing data exposure and leakage risks to keep sensitive information intact.

Regression testing

Ensuring stable performance across key product categories after software updates.

API testing

Verifying various recommendation endpoints and integrations with the platform.

Performance testing

Measuring response time and scalability, especially during heavy shopping periods.

Compliance testing

Making sure that recommendations meet data privacy and fairness standards.

Results

After seven months, the platform’s recommendation engine became reliable and able to generate substantial revenue again. By introducing structured AI QA and proactive data checks, TestFort helped the client increase engagement levels and establish long-term protection against drift.

Product managers now have a clear early warning system for relevance issues and a reliable way to validate data before each model retraining cycle.

arrow-up-right-round

20% +

CTR drop reversed

arrow-up-right-round

57%

boost in mobile conversions

arrow-up-right-round

25%

cost saving on QA and support

arrow-up-right-round

35%

improvement in user satisfaction

Ready to enhance your product’s stability and performance?

Schedule a call with our Head of Testing Department! 

    Bruce Mason

    Delivery Director

    Thank you for your message!

    We'll get back to you shortly!

    QA gaps don’t close with the tab.

    Level up you QA to reduce costs, speed up delivery and boost ROI.

    Start with booking a demo call
 with our team.