Accuracy & Quality Methodology

How we measure what we claim

Our 3-Tier Accuracy Framework

We don't have a single accuracy number. We have three distinct measurements — each serving a different purpose and representing a different level of rigor.

Tier 1

Human-Audited Accuracy — The Gold Standard

92.6%

Our most rigorous measurement. A human reviewer listens to the original audio and manually re-transcribes it word-for-word, then compares it to Retena's output.

Sample size: 500+ voice notes manually re-reviewed
Measurement: Word Error Rate (WER) — errors divided by total words
Audio types: Noisy job sites, mixed languages, heavy accents, recordings over 1 minute
Update cadence: Reviewed monthly with fresh samples

This is the number we stand behind when we talk about accuracy. It's not cherry-picked — reviewers are assigned random samples from real production audio.

Tier 2

Real-Life Production Quality

85.8/100

Measured across the live production stream using Retena's transcription quality scoring.

Sample size: 4,422 production scored transcriptions
Source: Actual WhatsApp voice notes from real users (anonymized)
Coverage: All audio quality levels — not filtered or cherry-picked
Update cadence: Continuously updated as volume grows

This score is currently in the Excellent band. It includes short clips, background noise, whispers, and every edge case that gets sent over WhatsApp.

Tier 3

AI Quality Score — Per-Note, Internal

0–100

Every transcription receives an internal quality score generated by a secondary AI model. This is not published per-note to users — it drives our internal review pipeline.

Signals used: Word count ratio, speech rate, repetition detection, punctuation presence, confidence values
Bands: Excellent / Good / Review / Poor
Purpose: Flag transcriptions that may need human review or should surface a low-confidence warning
Relationship to WER: Not a replacement — a complement. High AI score ≠ guaranteed 100% WER accuracy

Notes flagged as "Review" or "Poor" are pulled into our human audit queue, which feeds back into Tier 1 data.

Why This Matters

Most ASR (automatic speech recognition) benchmarks are run on clean studio audio — datasets like LibriSpeech, which consists of read-aloud audiobooks in quiet environments. Those benchmarks routinely show 97–99% accuracy.

That is not your use case.

Real WhatsApp audio is:

Recorded on the move — construction sites, vehicles, outdoor markets
Multilingual — Spanish and English in the same sentence
Accented — regional dialects, non-native speakers
Short and clipped — 5-second messages with no context
Sometimes voice-over-video with background audio bleeding in

We benchmark on what users actually send. That's how we know the current 85.8/100 production quality score is real.

"We publish what's real, not what's flattering."

How Word Error Rate (WER) Works

WER is the standard measurement for transcription quality. It compares what the model produced against a known-correct reference transcript.

The three error types

Substitution (S): The wrong word was transcribed (e.g., "cement" → "segment")
Insertion (I): An extra word was added that wasn't spoken
Deletion (D): A spoken word was missed entirely

The formula

WER = (S + I + D) / N × 100%

Where N = total number of words in the reference transcript

Plain-language example

If a 10-word sentence has 1 substitution error: WER = 1/10 × 100% = 10% WER = 90% accuracy.

Things that increase WER in real-world audio:

Accented speech — especially regional or non-native accents
Technical terminology and proper nouns (brand names, street names)
Background noise, wind, or competing voices
Very short recordings with little acoustic context
Code-switching (mixing languages mid-sentence)

Our 92.6% Tier 1 figure accounts for all of these conditions — it was measured on the hardest audio in our corpus, not the easiest.

What We Don't Claim

We don't claim 99% accuracy — that number comes from clean lab audio and doesn't reflect your WhatsApp voice notes.
We don't compare ourselves to academic benchmarks (LibriSpeech, etc.) — those datasets don't represent your use case.
We don't pad our numbers with only the best-performing samples. Our Tier 2 benchmark includes everything.
We don't freeze our numbers. When the data changes — up or down — we update them.

Questions?

If you have questions about our methodology, want to understand a specific transcription result, or have accuracy-related feedback, we want to hear from you.

Email us at [email protected]. Telegram support is planned on the roadmap, but it is not available yet.