Puente's 6 Conversation Modes: Which One to Use and When

Overview: Why Mode Selection Matters

Puente’s six conversation modes are not aesthetic preferences — they are distinct audio routing and interaction architectures designed for fundamentally different physical and social contexts. Choosing the wrong mode for your setting doesn’t just feel awkward; it can reduce translation quality, slow the interaction down, or require participants to behave unnaturally.

The right mode depends on three variables: how many people are involved, what hardware is available, and whether the parties are in the same physical space. This reference covers each mode in full, including its technical basis, hardware requirements, optimal conditions, and known limitations.

Mode 1: Auto-Detect

Hardware required: iPhone only
Best for: Casual one-on-one conversations, first-time users, any setting where a fully automated experience is preferred

Auto-detect is Puente’s default mode and the lowest-friction entry point. It monitors the audio input stream for silence thresholds — specifically, a pause of configurable duration (default: 0.8 seconds) — and uses that silence as the signal to trigger translation. When one party stops speaking, Puente translates what was said and delivers the output. When the other party begins speaking, Puente detects the new voice onset and begins capturing the next segment.

This mode does not require the user to press any button, tap any control, or interact with the screen at all during a conversation. It is the mode most likely to feel “invisible” to both parties, which is valuable when the goal is minimizing the technological presence in a human interaction.

Limitations: Auto-detect can be confused by background noise that mimics the amplitude profile of speech — a loud restaurant, a busy construction site, or a room with a running air conditioner. In those environments, switching to Tabletop mode (which has dedicated input buttons) or using a lapel microphone to isolate voice from ambient noise is preferable.

Mode 2: Tabletop

Hardware required: iPhone only
Best for: Structured professional interactions, medical consults, intake interviews, any setting where a phone on a desk is natural

Tabletop mode is designed for interactions where both parties are seated across from each other with the iPhone flat on a surface between them. Each party speaks toward the phone in turn; on-screen controls let either party initiate their speaking segment by tapping a large, clearly labeled button. The translated output plays from the phone’s speaker toward the other party.

The deliberate, turn-based structure of Tabletop mode has a practical advantage in professional settings: it creates natural pauses that both parties understand as “your turn to speak.” In clinical encounters, legal intakes, and HR interviews, this rhythm is familiar — it mirrors how interpreters work — and participants adapt to it quickly.

Tabletop mode also pairs cleanly with lapel microphones. With a Røde Wireless GO II or Hollyland Lark M2 connected, each party’s microphone feeds a dedicated channel, eliminating bleed between the two voices and improving transcription accuracy in acoustically challenging rooms.

Limitations: Tabletop mode requires manual turn-taking (button taps). For conversations with a more natural, flowing rhythm — two friends, a parent and child — Auto-detect typically feels less interrupting.

Mode 3: Earbud

Hardware required: One pair of stereo earbuds (Bluetooth or wired)
Best for: Private conversations, clinical exam rooms, legal consultations, any scenario requiring confidentiality

Earbud mode splits the translation output by language across the left and right audio channels. See the Earbud Share Mode setup guide. Party A’s language is routed to the left channel; Party B’s language is routed to the right. Each person wears one earbud from the same pair — left bud in one ear, right bud in the other’s ear — and hears only the translation of what the other person said, in their own language.

This creates a fully bilateral, nearly simultaneous experience: each person speaks naturally, and moments later, hears the translation privately in their ear without the other party hearing it repeated aloud. There is no loudspeaker output in Earbud mode, which is significant for privacy.

In a clinical exam room, this means a physician and patient can discuss symptoms, medication history, and diagnosis privately, without translated audio broadcasting to adjacent beds or hallways. In a legal consultation, an attorney and client speak privately with no translated content audible to anyone else in the room.

Earbud mode also activates Puente’s Auto Voice Matching feature, routing a translated voice that reflects the speaker’s original vocal characteristics — so each party hears a voice that sounds like the person they’re speaking with, not a generic text-to-speech output.

Limitations: Requires sharing earbuds, which some participants may be reluctant to do in clinical or formal settings. Single-use earbud covers are an easy solution for clinical deployments.

Mode 4: Smart Glasses

Hardware required: Compatible smart glasses connected via A2DP Bluetooth
Best for: Hands-free professional use, any scenario where holding a phone is impractical, high-mobility environments

Smart Glasses mode activates when Puente detects a compatible device. See the Smart Glasses setup guide. connected via the A2DP Bluetooth audio profile. Detection is automatic — no manual mode selection is needed. Translated audio is routed to the glasses’ built-in speakers rather than the iPhone’s speaker or connected earbuds.

Compatible devices:

Ray-Ban Meta (all generations) — open-ear speakers with full translation audio routing
Xreal Air / Air 2 / Air 2 Pro / Air 2 Ultra — AR display + audio routing
Even Realities G1 — audio routing with optional caption display
ActiveLook Engo 2 — optimized for sport and mobility contexts

Smart Glasses mode is particularly valuable for professionals who need both hands free and cannot look at a screen — a surgeon conducting a pre-op consult, a foreman walking a site, a technician troubleshooting equipment while communicating with a non-English-speaking counterpart.

The voice distinction benefit is important in Smart Glasses mode: because translated audio plays through the glasses speaker near one ear, the listener’s perception of who is speaking stays clear even when both parties are contributing to a rapid exchange.

Limitations: Smart Glasses mode depends on the acoustic quality of the glasses’ built-in speakers, which vary by model. Ray-Ban Meta speakers produce clearly audible audio at normal conversational distances; more compact models like the Engo 2 are better suited to quiet environments.

Mode 5: Remote

Hardware required: Two iPhones, each with Puente installed; internet connection
Best for: Phone and video calls, telehealth appointments, remote client consultations, any interaction where parties are not in the same physical space

Remote mode enables real-time translation between two Puente users at any distance. See the full Remote Mode guide for setup. One user starts a session, which generates a 6-digit connection code. The other user enters that code in their Puente app. The session begins immediately — no accounts, no logins, no link to share, no app other than Puente required on either end.

Each party speaks in their language and hears the other’s words translated in real time, with latency comparable to a standard phone call. Remote mode is designed for scenarios where a video or voice call is happening simultaneously (via FaceTime, WhatsApp, Zoom, or any platform) and Puente runs alongside it, handling the translation layer independently.

Remote mode is particularly useful for telehealth. A physician on a video call with a patient who speaks a different language can run Puente in Remote mode without asking the patient to install anything complex — just Puente (free, no account needed) and a 6-digit code. The session is end-to-end with no audio stored on any server.

Limitations: Remote mode requires an active internet connection on both devices. It does not support the offline Whisper AI processing available in other modes.

Mode 6: Group

Hardware required: iPhone for the host; any device with Puente for participants
Best for: Team meetings, safety briefings, classroom instruction, multi-party community gatherings

Group mode supports up to 8 simultaneous participants. See the Group Mode guide., each speaking and receiving translation in their own language. Speaker diarization — the process of identifying and labeling who is speaking — tags each translated segment by speaker, so participants reading a transcript or following along can tell who said what.

The host’s iPhone acts as the session hub. Other participants join via a group code. The host can configure which language each participant is assigned, or participants can self-select. Translated output can be delivered as audio to each participant’s device or as on-screen text captions.

Group mode is the correct choice for:

Morning safety briefings with multilingual crews (Spanish, Portuguese, Vietnamese, English simultaneously)
IEP or parent-teacher meetings with multiple family members who speak different languages
Community health clinics serving multiple language populations in a waiting room or group education session
Disaster relief coordination where a field commander needs to brief responders speaking four different languages at once

Limitations: Group mode requires an internet connection for the multi-party session infrastructure. Each participant needs Puente installed on their device, though the free tier (5 translations per day) is sufficient for participation in a session hosted by a Pro user.

Mode Comparison Table

Mode	Hardware needed	Max parties	Internet required	Best setting
Auto-detect	iPhone only	2	No (offline supported)	Casual, spontaneous
Tabletop	iPhone only	2	No (offline supported)	Seated professional
Earbud	1 pair stereo earbuds	2	No (offline supported)	Private, confidential
Smart Glasses	Compatible glasses	2	No (offline supported)	Hands-free, mobile
Remote	2 iPhones	2	Yes	Remote/distance
Group	iPhone + participants	Up to 8	Yes	Team, multi-party
Mesh Rooms	iPhone (host) + any browser	Classroom scale	Yes	Broadcast, events

Offline support (via Whisper AI on-device) is available for English, Spanish, French, German, Portuguese, Italian, Japanese, and Mandarin in Auto-detect, Tabletop, Earbud, and Smart Glasses modes. See how offline mode works for details.

Mode 7: Mesh Rooms

Hardware required: iPhone (host); any browser-capable device (participants)
Best for: Classrooms, community meetings, live events, corporate briefings with mixed-language audiences

Mesh Rooms extend Puente’s reach beyond the 8-person Group mode limit to any size audience. The host creates a room, which generates a QR code pointing to puente.chat/r/[roomID]. Participants scan the QR code — no app install required — and immediately see live captions in their own language as the host speaks. Each participant selects their language when they join, and translation fan-out delivers captions simultaneously across all selected languages.

Rooms auto-expire after 4 hours. The host’s local translation runs on a completely separate pipeline — if the room server has any issue, the host’s real-time translation continues without interruption. See the full Mesh Rooms knowledge base article for setup details and architecture.

Limitations: In the current version, Mesh Rooms are host-broadcast only — participants receive captions but cannot speak back through the room. Two-way participant speech is on the roadmap.

Download Puente — all 6 modes available, free to start