Knowledge Base

Voice Passthrough: AI Voice Cloning for Translated Audio Output

What Voice Passthrough Does

Auto Voice Matching selects the closest pre-existing TTS voice in the target language to match the speaker’s characteristics. Voice Passthrough does something different: it uses the speaker’s actual voice as the source for TTS synthesis, creating translated audio that sounds like the same person speaking the target language.

The practical difference is significant in contexts where voice recognition matters. A patient who has been speaking with a physician for 20 minutes recognizes the physician’s voice. If the translated output of the physician’s words arrives in a clearly generic TTS voice, the connection between the person the patient knows and the words they’re hearing is broken. Voice Passthrough closes that gap.

Voice cloning is a capability with real-world implications — a cloned voice can produce audio that sounds like someone saying something they never said. Puente treats this seriously.

Voice Passthrough requires explicit two-step consent before activation:

  1. Consent checkbox — reading and checking a box that explains what voice cloning does, what data is used, and that the cloned voice model is stored only on-device
  2. “I Agree” tap — a separate confirmation button that must be pressed after the checkbox

The system hard-rejects any clone request that does not include consent: true in the request parameters. There is no way to enable Voice Passthrough for yourself or anyone else without completing both consent steps. This is enforced at the Worker level — it is not a UI gate that can be bypassed.

How It Works Technically

When Voice Passthrough is consented to and active:

  1. A lightweight voice sample is captured from the speaker’s first 10–15 seconds of natural speech in the session
  2. The sample is used to generate a voice synthesis model that captures the speaker’s key vocal characteristics: fundamental frequency range, formant distribution, and vocal energy envelope
  3. All subsequent translation output for that speaker is synthesized using this model rather than a pre-existing TTS voice
  4. The model is stored locally on-device only — never transmitted

If the voice sample is insufficient (too short, too noisy), or if synthesis times out, the system falls back automatically to Auto Voice Matching for that translation turn. The fallback is seamless — no notification appears, and translation output is never blocked.

Voice Passthrough vs. Auto Voice Matching

Auto Voice MatchingVoice Passthrough
SourcePre-existing TTS voice librarySpeaker’s own voice
Consent requiredNoYes (two-step)
Setup timeNone (first 3–5 seconds of speech)~10–15 seconds for initial sample
AccuracyClosest available matchNear-exact speaker match
FallbackLower-confidence voice selectionAuto Voice Matching
Best forAll sessions by defaultLong sessions, known relationships

Privacy

The voice model generated by Voice Passthrough is stored exclusively on the user’s device. It is not transmitted to any server, not used for any purpose outside of Puente’s translation output, and not retained after the user clears it. Voice models can be deleted in Settings → Privacy → Clear Voice Models.

Download Puente — Voice Passthrough available with Pro