About project
Solution
AI model testing, Prompt engineering validation, Bias detection, Regression testing, API testing, Integration testing, Security testing
Technologies
OpenAI GPT-4, LangChain, Python, MLflow, PyTorch, Postman, Jira
Country
United States
Industry
Client
The client is a mid-sized US SaaS company offering a sales enablement platform for enterprise B2B teams. They had recently launched an AI-powered sales copilot that drafts prospecting and follow-up emails. However, the tool’s inconsistent, sometimes biased content resulted in low user trust and poor adoption among sales reps.
Project overview
Let’s create a new success story together.
Before
- High hallucination rate
- Off-brand messaging
- Low user trust
- Slowed-down adoption
After
- 60% fewer hallucinations
- Consistent professional tone
- Satisfaction 6.5 → 8.7
- 40% more active users
Project Duration
6 months
Team Composition
1 QA Lead
4 Manual QAs
Challenge
The client’s sales copilot was intended to help business development representatives save time by drafting prospecting and follow-up emails. Instead, it quickly became a source of frustration and risk. Users reported that the AI invented product capabilities, misrepresented pricing, and sometimes sent off-brand or unprofessional messages, creating potential reputational harm.
Some responses also showed gender and racial bias, undermining trust in the tool. Because the company had no formal way to measure or review AI outputs, quality issues were discovered late, often only after customers flagged them. The existing QA team, experienced in functional testing, lacked the specialized methods needed to validate generative AI content and ensure safety, tone, and factual correctness.
Key challenges on the project included:
- The solution frequently fabricated product details or commitments that the company couldn’t honor.
- Messages lacked a consistent tone — some overly casual, some excessively formal — weakening the brand’s voice.
- Gender and racial bias surfaced in the generated text.
- There were no clear quality metrics or repeatable review methods for AI outputs.
The client’s existing QA team had experience with functional testing but lacked structured ways to measure the quality, safety, and trustworthiness of generative AI.
Solutions
TestFort introduced a focused, scenario-driven AI QA framework to bring control and consistency to the copilot’s performance. We began by reviewing existing prompts and outputs, then built a library of 2,500+ edge cases and real sales scenarios to test tone, accuracy, and compliance. A clear scoring system was created to rate factual correctness, professionalism, and inclusivity, giving the client a repeatable way to evaluate every AI response.
To keep improvements sustainable, we added regression checks to monitor model drift, reviewed CRM and email platform integrations, and performed security assessments to prevent prompt injection and data leakage. By formalizing AI validation and spotting issues early, the client was able to reduce rework and cut long-term QA and support costs by 25%.
Key activities included:
- Prompt library creation: Compiled 2,500+ edge-case and business-specific prompts to stress-test the model.
- Human-guided output review: Experts scored AI responses for factual accuracy, tone, compliance, and inclusivity.
- Bias detection workshops: Applied checklists for gender and racial fairness, with specialist review cycles.
- Regression review boards: Regular side-by-side comparison of responses from new model versions to spot drift.
- API & integration review: Validated copilot connections with CRM and email tools through exploratory sessions.
- Security review: Checked for prompt injection and data leakage risks.
- Quality dashboards: Consolidated evaluation scores into a simple reporting system to be easily read by humans.
This process gave the client clear, actionable insights into how the copilot performed, enabling safe model updates and a better user experience.
Technologies
Choosing the right tools was key to evaluating and improving AI performance while keeping the process flexible and aligned with project goals. We needed a stack that supported prompt management, experiment tracking, and detailed review of outputs without relying on full automation.
- OpenAI GPT-4
- LangChain
- MLflow
- PyTorch
- Postman
- Jira
Types of testing
Security testing
Exploring prompt injection and risks linked to sensitive data leakage.
Compliance testing
Ensuring the outputs meet internal and legal communication standards.
Results
In just six months, the copilot evolved from an often unreliable pilot feature to a trusted sales productivity tool. The structured, human-driven AI testing approach dramatically reduced hallucinations, removed bias, and gave the product team a clear quality baseline.
Sales representatives regained confidence, were able to spend less time rewriting AI drafts, and embraced the tool in daily outreach. Product managers could now ship updates knowing how quality and safety would be reliably measured.
60%
fewer hallucinations
35%
improvement in accuracy
6.5 → 8.7
user satisfaction growth
40%
rise in monthly active users
Ready to enhance your product’s stability and performance?
Schedule a call with our Head of Testing Department!
Bruce Mason
Delivery Director
