← Back to Retena

Accuracy & Quality Methodology

How we measure what we claim

Our 3-Tier Accuracy Framework

We don't have a single accuracy number. We have three distinct measurements — each serving a different purpose and representing a different level of rigor.

Tier 1

Human-Audited Accuracy — The Gold Standard

92.6%

Our most rigorous measurement. A human reviewer listens to the original audio and manually re-transcribes it word-for-word, then compares it to Retena's output.

  • Sample size: 500+ voice notes manually re-reviewed
  • Measurement: Word Error Rate (WER) — errors divided by total words
  • Audio types: Noisy job sites, mixed languages, heavy accents, recordings over 1 minute
  • Update cadence: Reviewed monthly with fresh samples

This is the number we stand behind when we talk about accuracy. It's not cherry-picked — reviewers are assigned random samples from real production audio.

Tier 2

Real-Life Production Quality

85.8/100

Measured across the live production stream using Retena's transcription quality scoring.

  • Sample size: 4,422 production scored transcriptions
  • Source: Actual WhatsApp voice notes from real users (anonymized)
  • Coverage: All audio quality levels — not filtered or cherry-picked
  • Update cadence: Continuously updated as volume grows

This score is currently in the Excellent band. It includes short clips, background noise, whispers, and every edge case that gets sent over WhatsApp.

Tier 3

AI Quality Score — Per-Note, Internal

0–100

Every transcription receives an internal quality score generated by a secondary AI model. This is not published per-note to users — it drives our internal review pipeline.

  • Signals used: Word count ratio, speech rate, repetition detection, punctuation presence, confidence values
  • Bands: Excellent / Good / Review / Poor
  • Purpose: Flag transcriptions that may need human review or should surface a low-confidence warning
  • Relationship to WER: Not a replacement — a complement. High AI score ≠ guaranteed 100% WER accuracy

Notes flagged as "Review" or "Poor" are pulled into our human audit queue, which feeds back into Tier 1 data.

Why This Matters

Most ASR (automatic speech recognition) benchmarks are run on clean studio audio — datasets like LibriSpeech, which consists of read-aloud audiobooks in quiet environments. Those benchmarks routinely show 97–99% accuracy.

That is not your use case.

Real WhatsApp audio is:

We benchmark on what users actually send. That's how we know the current 85.8/100 production quality score is real.

"We publish what's real, not what's flattering."

How Word Error Rate (WER) Works

WER is the standard measurement for transcription quality. It compares what the model produced against a known-correct reference transcript.

The three error types

The formula

WER = (S + I + D) / N × 100%

Where N = total number of words in the reference transcript

Plain-language example

If a 10-word sentence has 1 substitution error: WER = 1/10 × 100% = 10% WER = 90% accuracy.

Things that increase WER in real-world audio:

Our 92.6% Tier 1 figure accounts for all of these conditions — it was measured on the hardest audio in our corpus, not the easiest.

What We Don't Claim

Questions?

If you have questions about our methodology, want to understand a specific transcription result, or have accuracy-related feedback, we want to hear from you.

Email us at [email protected]. Telegram support is planned on the roadmap, but it is not available yet.