Why Indian doctors need an AI scribe that speaks their language

Walk into any clinic in Chennai, Coimbatore, or Madurai and listen. The doctor asks the patient what's wrong in Tamil. The patient describes chest pain using a mix of Tamil words and English medical terms they picked up from Google. The doctor responds with a diagnosis peppered with English drug names, dosage instructions in Tamil, and follow-up timelines that switch between both languages mid-sentence. This is how Indian medicine actually sounds. And not a single global medical scribe product can handle it.
The code-mixing problem no one talks about
When Indian doctors speak to patients, they don't pick one language and stick with it. They code-mix - weaving Tamil, Hindi, or Telugu with English medical terminology in a single breath. A cardiologist in Chennai might say: "Ungaluku mild hypertension irukku, so we'll start you on Amlodipine 5mg, daily morning food-ku appuram edukkanum." That's Tamil structure, English diagnosis, English drug name, and Tamil instruction - all in one sentence. OpenAI Whisper and Google Speech-to-Text treat this as a monolingual stream. They pick one language model and force the entire utterance through it. The result is garbled medical terminology, lost dosage information, and clinical notes that no doctor would trust.
Why Western speech-to-text fails for Indian consultations
- Accent mismatch: Models trained on American and British English consistently misrecognize Indian English pronunciation of drug names. Metformin becomes "met for men." Atorvastatin becomes unrecognizable noise.
- No code-switching support: Whisper's language detection picks one language per segment. When a doctor switches from Tamil to English for a drug name and back to Tamil for instructions, the model either drops the English or drops the Tamil.
- Medical vocabulary gaps: Generic STT models have no training on Indian pharmaceutical brand names (Dolo-650, Crocin, Pantop), regional disease terminology, or the specific way Indian doctors abbreviate clinical terms.
- Context-free transcription: Without understanding that the conversation is medical, these models can't distinguish between "sugar" (the food) and "sugar" (diabetes, as commonly referred to by patients across India).
What Sarvam AI's approach solves
Sarvam AI built Saaras - a speech-to-text engine designed from the ground up for Indian languages. Unlike bolted-on multilingual support, Saaras treats code-mixing as a first-class input type. It was trained on real Indian speech data: customer service calls, educational lectures, casual conversations, and domains where technical English terminology appears inside vernacular language frames. When we built Larinova on top of Saaras, we added a medical vocabulary layer. This means the STT engine knows that when a Tamil-speaking doctor says a word that sounds like English, it's probably a drug name, diagnosis, or procedure - and it should be transcribed with clinical precision, not forced through a Tamil phonetic model.
We ran the same consultation through three engines
During our early testing, we recorded a 6-minute consultation between a cardiologist in T. Nagar and a 54-year-old patient with suspected hypertension. The conversation was roughly 60% Tamil, 40% English, with drug names and vitals in English. We ran the same audio through OpenAI Whisper, Google Cloud Speech-to-Text, and Sarvam Saaras. Whisper transcribed "Amlodipine" as "am lo di pine" and missed the dosage entirely. Google STT captured the English segments reasonably but dropped most of the Tamil connective tissue between clinical terms, producing a transcript that read like a keyword list. Saaras produced a coherent, readable transcript where the Tamil sentence structure was preserved and the English medical terms were correctly identified and spelled. The difference wasn't subtle. One output was usable for SOAP note generation. The other two weren't.
The gap is not accuracy percentages on a benchmark. It's whether a doctor looks at the transcript and says 'yes, that's what I said' or throws it away and writes notes by hand.
What this means for doctors on the ground
The average Indian doctor sees 30-50 patients per day. Many see more. Documentation takes 15-20 minutes per patient when done manually. That's hours of writing after the clinic closes - or worse, notes that never get written at all. When transcription works in the language the doctor actually speaks, documentation stops being the bottleneck. You talk to your patient exactly as you normally would. Larinova listens, understands the code-mixed speech, extracts the clinical information, and produces structured SOAP notes. No behavior change required. No speaking slowly. No switching to English for the AI's benefit.
Tip
Larinova currently supports Tamil+English and Hindi+English code-mixed consultations. Telugu, Kannada, Malayalam, and Bengali are launching over the next two quarters.

