What Voice Passthrough Does
Auto Voice Matching selects the closest pre-existing TTS voice in the target language to match the speaker’s characteristics. Voice Passthrough does something different: it uses the speaker’s actual voice as the source for TTS synthesis, creating translated audio that sounds like the same person speaking the target language.
The practical difference is significant in contexts where voice recognition matters. A patient who has been speaking with a physician for 20 minutes recognizes the physician’s voice. If the translated output of the physician’s words arrives in a clearly generic TTS voice, the connection between the person the patient knows and the words they’re hearing is broken. Voice Passthrough closes that gap.
The Consent Requirement
Voice cloning is a capability with real-world implications — a cloned voice can produce audio that sounds like someone saying something they never said. Puente treats this seriously.
Voice Passthrough requires explicit two-step consent before activation:
- Consent checkbox — reading and checking a box that explains what voice cloning does, what data is used, and that the cloned voice model is stored only on-device
- “I Agree” tap — a separate confirmation button that must be pressed after the checkbox
The system hard-rejects any clone request that does not include consent: true in the request parameters. There is no way to enable Voice Passthrough for yourself or anyone else without completing both consent steps. This is enforced at the Worker level — it is not a UI gate that can be bypassed.
How It Works Technically
When Voice Passthrough is consented to and active:
- A lightweight voice sample is captured from the speaker’s first 10–15 seconds of natural speech in the session
- The sample is used to generate a voice synthesis model that captures the speaker’s key vocal characteristics: fundamental frequency range, formant distribution, and vocal energy envelope
- All subsequent translation output for that speaker is synthesized using this model rather than a pre-existing TTS voice
- The model is stored locally on-device only — never transmitted
If the voice sample is insufficient (too short, too noisy), or if synthesis times out, the system falls back automatically to Auto Voice Matching for that translation turn. The fallback is seamless — no notification appears, and translation output is never blocked.
Voice Passthrough vs. Auto Voice Matching
| Auto Voice Matching | Voice Passthrough | |
|---|---|---|
| Source | Pre-existing TTS voice library | Speaker’s own voice |
| Consent required | No | Yes (two-step) |
| Setup time | None (first 3–5 seconds of speech) | ~10–15 seconds for initial sample |
| Accuracy | Closest available match | Near-exact speaker match |
| Fallback | Lower-confidence voice selection | Auto Voice Matching |
| Best for | All sessions by default | Long sessions, known relationships |
Privacy
The voice model generated by Voice Passthrough is stored exclusively on the user’s device. It is not transmitted to any server, not used for any purpose outside of Puente’s translation output, and not retained after the user clears it. Voice models can be deleted in Settings → Privacy → Clear Voice Models.