← Back to Retena
Accuracy & Quality Methodology
How we measure what we claim
Our 3-Tier Accuracy Framework
We don't have a single accuracy number. We have three distinct measurements — each serving a different purpose and representing a different level of rigor.
Tier 1
Human-Audited Accuracy — The Gold Standard
92.6%
Our most rigorous measurement. A human reviewer listens to the original audio and manually re-transcribes it word-for-word, then compares it to Retena's output.
- Sample size: 500+ voice notes manually re-reviewed
- Measurement: Word Error Rate (WER) — errors divided by total words
- Audio types: Noisy job sites, mixed languages, heavy accents, recordings over 1 minute
- Update cadence: Reviewed monthly with fresh samples
This is the number we stand behind when we talk about accuracy. It's not cherry-picked — reviewers are assigned random samples from real production audio.
Tier 2
Real-Life Production Quality
85.8/100
Measured across the live production stream using Retena's transcription quality scoring.
- Sample size: 4,422 production scored transcriptions
- Source: Actual WhatsApp voice notes from real users (anonymized)
- Coverage: All audio quality levels — not filtered or cherry-picked
- Update cadence: Continuously updated as volume grows
This score is currently in the Excellent band. It includes short clips, background noise, whispers, and every edge case that gets sent over WhatsApp.
Tier 3
AI Quality Score — Per-Note, Internal
0–100
Every transcription receives an internal quality score generated by a secondary AI model. This is not published per-note to users — it drives our internal review pipeline.
- Signals used: Word count ratio, speech rate, repetition detection, punctuation presence, confidence values
- Bands: Excellent / Good / Review / Poor
- Purpose: Flag transcriptions that may need human review or should surface a low-confidence warning
- Relationship to WER: Not a replacement — a complement. High AI score ≠ guaranteed 100% WER accuracy
Notes flagged as "Review" or "Poor" are pulled into our human audit queue, which feeds back into Tier 1 data.
Why This Matters
Most ASR (automatic speech recognition) benchmarks are run on clean studio audio — datasets like LibriSpeech, which consists of read-aloud audiobooks in quiet environments. Those benchmarks routinely show 97–99% accuracy.
That is not your use case.
Real WhatsApp audio is:
- Recorded on the move — construction sites, vehicles, outdoor markets
- Multilingual — Spanish and English in the same sentence
- Accented — regional dialects, non-native speakers
- Short and clipped — 5-second messages with no context
- Sometimes voice-over-video with background audio bleeding in
We benchmark on what users actually send. That's how we know the current 85.8/100 production quality score is real.
"We publish what's real, not what's flattering."
How Word Error Rate (WER) Works
WER is the standard measurement for transcription quality. It compares what the model produced against a known-correct reference transcript.
The three error types
- Substitution (S): The wrong word was transcribed (e.g., "cement" → "segment")
- Insertion (I): An extra word was added that wasn't spoken
- Deletion (D): A spoken word was missed entirely
The formula
WER = (S + I + D) / N × 100%
Where N = total number of words in the reference transcript
Plain-language example
If a 10-word sentence has 1 substitution error: WER = 1/10 × 100% = 10% WER = 90% accuracy.
Things that increase WER in real-world audio:
- Accented speech — especially regional or non-native accents
- Technical terminology and proper nouns (brand names, street names)
- Background noise, wind, or competing voices
- Very short recordings with little acoustic context
- Code-switching (mixing languages mid-sentence)
Our 92.6% Tier 1 figure accounts for all of these conditions — it was measured on the hardest audio in our corpus, not the easiest.
What We Don't Claim
- We don't claim 99% accuracy — that number comes from clean lab audio and doesn't reflect your WhatsApp voice notes.
- We don't compare ourselves to academic benchmarks (LibriSpeech, etc.) — those datasets don't represent your use case.
- We don't pad our numbers with only the best-performing samples. Our Tier 2 benchmark includes everything.
- We don't freeze our numbers. When the data changes — up or down — we update them.
Questions?
If you have questions about our methodology, want to understand a specific transcription result, or have accuracy-related feedback, we want to hear from you.
Email us at [email protected]. Telegram support is planned on the roadmap, but it is not available yet.