About project
Solution
AI model testing, Data quality validation, Model drift detection, Regression testing, Bias & relevance evaluation, Performance testing, Security testing
Technologies
Python, OpenAI GPT-4, scikit-learn, MLflow, SQL-based data profiling, Postman, Jira
Country
United States
Industry
Client
The client is a mid-sized US online retailer with a rapidly growing catalog of consumer goods and millions of monthly visitors. Its AI-driven recommendation engine is central to sales, but declining relevance and performance were hurting user engagement and revenue.
Project overview
Let’s make your app’s quality its strongest asset.
Before
- 20% drop in CTR
- Irrelevant suggestions
- High defect count
- Customer frustration
After
- CTR fully restored
- 95% validated precision
- 70% fewer defects
- Improved user trust
Project Duration
7 months
Team Composition
1 QA Lead
5 Manual QAs
Challenge
The client’s recommendation engine had once been a key revenue driver, boosting user engagement and average order value. But over time, its performance declined sharply. Customers began seeing irrelevant or repetitive product suggestions, leading to frustration and fewer clicks. Internal analytics showed a drop of more than 20% in click-through rates, which directly impacted sales and marketing campaigns built around personalized recommendations.
A deeper investigation revealed several intertwined issues:
- Model drift: The AI gradually moved away from its original accuracy as shopping patterns, product catalogs, and customer behavior evolved.
- Poor data quality: Missing attributes, outdated product descriptions, duplicate listings, and mislabeled categories disrupted training signals.
- Lack of structured QA: There were no formal test cases, no relevance scoring, and no clear way to measure AI quality beyond high-level engagement metrics.
- Delayed detection: Problems surfaced only after revenue impact became visible, slowing down response time.
With the holiday shopping season approaching, leadership needed to stabilize recommendations quickly, stop revenue decline, and create a sustainable way to monitor and maintain model quality.
Solutions
Our comprehensive testing strategy involved four key approaches:
Manual testing. We explored the plugin’s functionality, simulating real-world user scenarios. This included testing various web layouts, phone number formats, and edge cases to ensure a smooth user experience.
Black-box testing. We verified the plugin’s inputs and outputs without access to the internal code. This involved testing phone number recognition accuracy, call initiation speed, and proper integration with the Skype application.
Regression testing. After each update, we re-ran our test suite to catch any new bugs or unintended side effects. This iterative process helped maintain plugin stability throughout development.
Localization testing. We checked the plugin’s functionality across multiple languages and regions, ensuring proper number formatting, currency symbol placement, and compatibility with international phone numbers.
Technologies
The chosen tools helped the team track AI quality over time, validate data, and run structured evaluations without relying on heavy automation. The stack supported flexible experiments, clear reporting, and collaboration between QA engineers and data teams.
- Python
- OpenAI GPT-4
- Scikit-learn
- MLflow
- Postman
- Jira
Types of testing
Security testing
Assessing data exposure and leakage risks to keep sensitive information intact.
Performance testing
Measuring response time and scalability, especially during heavy shopping periods.
Compliance testing
Making sure that recommendations meet data privacy and fairness standards.
Results
After seven months, the platform’s recommendation engine became reliable and able to generate substantial revenue again. By introducing structured AI QA and proactive data checks, TestFort helped the client increase engagement levels and establish long-term protection against drift.
Product managers now have a clear early warning system for relevance issues and a reliable way to validate data before each model retraining cycle.
20% +
CTR drop reversed
57%
boost in mobile conversions
25%
cost saving on QA and support
35%
improvement in user satisfaction
Ready to enhance your product’s stability and performance?
Schedule a call with our Head of Testing Department!
Bruce Mason
Delivery Director
