AI Assistant Quality Audit for a CI/CD Platform

Helping a leading CI/CD platform identify 25 critical defects, reduce hallucination frequency by 60%, and establish a continuous testing framework that improved reply accuracy from 65% to 82%.

About project

Solution

Functional testing, Usability testing, Localization testing, Performance testing, Security testing, Automation testing, Adaptability testing, Proactivity testing, Hallucination detection testing, Explainability testing, Transparency testing

Technologies

JMeter, Postman, Selenium, Jira, TestRail, OWASP ZAP

Country

United States

Industry

Technology

Project Duration

8 weeks

Team Composition

3 Manual QAs

1 Automation QA

Challenge

The AI assistant was launched to simplify CI/CD setup and troubleshooting, but it struggled to meet user expectations for intelligence and reliability. While developers appreciated its quick access to documentation and code snippets, they also encountered issues with adaptability, accuracy, and continuity. Without proactive guidance or substantial memory, the assistant often felt like a searchable help page rather than an intelligent collaborator. 

To understand and address these limitations, the client was looking for an independent QA review to evaluate the assistant’s real-world performance, identify key improvement areas, and build a standards-based roadmap for enhancing developer experience and trust.

Key challenges included:

  • Undefined quality metrics or performance baselines
  • Limited adaptivity to user expertise and workflow context
  • Lack of proactive guidance or next-step suggestions
  • No memory retention between sessions
  • Inconsistent accuracy and occasional hallucinations
  • Weak localization and accessibility readiness
  • Absence of automated regression testing
  • Missing security safeguards for sensitive or prompt-injected data

Solutions

TestFort conducted a comprehensive 8-week AI quality audit combining manual exploratory testing, automated regression, and usability validation. Our team evaluated the assistant’s adaptability, context awareness, and explanation quality using real developer workflows that mirrored production CI/CD environments. We compared the AI’s behavior across different user personas and technical scenarios to measure consistency, proactivity, and accuracy under realistic conditions.

Based on these findings, we implemented a data-driven improvement plan aligned with ISO/IEC 25010:2023 standards and AI evaluation best practices. The project delivered a full set of test assets, automation coverage, and actionable recommendations that helped the client enhance reliability, performance, and user trust in the assistant.

  • Exploratory testing across 15-20 real CI/CD scenarios. We recreated authentic developer workflows such as build setup, test optimization, and error resolution to evaluate how well the assistant supported end-to-end development.
  • Persona-based validation for beginner, intermediate, and expert users. Each testing persona was used to assess whether the AI adapted its communication style, technical depth, and clarity to the user’s level of expertise.
  • Hallucination and response-consistency testing across repeated prompts. We checked the AI’s stability by asking the same questions multiple times, identifying when it generated incorrect or fabricated responses.
  • Security and privacy validation against prompt injection and data leaks. Testers attempted to extract sensitive data and observed the assistant’s handling of unsafe or corrupt input to ensure user data remained secure.
  • Usability testing with 5-8 real users. Participants evaluated the assistant’s helpfulness, clarity, and responsiveness during typical setup and troubleshooting tasks, providing valuable qualitative insights.
  • ISO/IEC 25010:2023 and ISO 9241-110:2020 compliance analysis. The product’s quality and usability were benchmarked against recognized international standards to identify objective improvement areas.
  • Automated regression suite with 15-20 core test cases. We developed and implemented automated scripts to verify that future model updates wouldn’t reintroduce previously fixed defects.
  • Continuous monitoring dashboard for AI accuracy and response time. A metrics dashboard was set up to track response quality, latency, and consistency in real time, providing visibility into ongoing performance trends.

Technologies

The tools used on the project allowed the team to track AI quality trends, validate response data, and conduct structured evaluations. This tech stack supported flexible experimentation, transparent reporting, and efficient collaboration between QA engineers and data specialists throughout the audit.

  • JMeter
  • Postman
  • Selenium
  • Jira
  • TestRail
  • OWASP ZAP

Types of testing

Functional testing

Verifying that core AI features work correctly across real developer workflows.

Usability testing

Evaluating clarity, intuitiveness, and user satisfaction in practical scenarios.

Localization testing

Checking preparedness for multilingual and region-specific user experiences.

Performance testing

Measuring response speed, stability, and output accuracy under typical load.

Security testing

Detecting vulnerabilities, unsafe prompts, and potential data exposure risks.

Automation testing

Maintaining consistency through repeatable regression and monitoring scripts.

Results

The 4-week audit provided the client with a complete picture of the AI assistant’s current quality and its impact on developer workflows. By combining manual, automated, and user-based testing, TestFort identified key limitations in adaptability, proactivity, and security while delivering actionable improvements backed by measurable data. The resulting recommendations and test assets formed a foundation for continuous quality monitoring and faster iteration cycles.

Following the implementation of the proposed fixes and test automation, the platform achieved significant improvements in response quality, reliability, and user satisfaction, creating the groundwork for a more trusted, production-ready AI assistant.

arrow-up-right-round

25

critical AI defects identified

arrow-up-right-round

60%

reduction in hallucinations

arrow-up-right-round

17%

accuracy improvement

arrow-up-right-round

20%

boost in user satisfaction

Ready to enhance your product’s stability and performance?

Schedule a call with our Head of Testing Department! 

    Bruce Mason

    Delivery Director

    Thank you for your message!

    We'll get back to you shortly!

    QA gaps don’t close with the tab.

    Level up you QA to reduce costs, speed up delivery and boost ROI.

    Start with booking a demo call
 with our team.