“What I’m trying to do is trying to model me. How I think. The way I generate similes, I teach that to a machine. The machine is now even more powerful, because it knows more more words than me,” says Madhan Karky, who calls himself a “lyrics engineer”. We are at his office in Besant Nagar, Chennai and Karky is describing one of his experiments: a simile generator that on the fly throws up figures of speech that make comparisons more vivid.
The son of Tamil poet and lyricist Vairamuthu, Karky has a decade-long legacy of using technology to write movie dialogues and lyrics. Even if you haven’t heard of him, you perhaps have come across his work in two blockbuster movies from the south. He created a new language called Kiliki in the movie Baahubali and was a dialogue writer and lyricist in the 2010 science-fiction film Enthiran (aka Robot).
A simile generator presents a unique set of challenges: can a machine understand and appreciate a good simile? Can the similes generated have never been thought of before? He says was able to accomplish this feat, based on a graph theory model. “We were able to generate thousands of them in a second. We had a lot of fun with that,” he says. Examples: Rail pondra neenda koondal (her hair, long as a railway train) or Thiruvizha pola sandosham kuduka ural (a relationship that gives happiness like a festival).
Graph theory is a field of mathematics, and refers to the study of graphs. As the Wikipedia page notes, graph-theoretic methods are used in a lot of STEM fields, and are particularly useful in linguistics and NLP (natural language processing).
Building software that can generate original Tamil similes is hard work, though. Karky says he had to feed tens of thousands of Tamil song lyrics so that the program would only generate similes that others had never thought of before. Such experiments also need basic building blocks: tools like a dictionary, a compound word splitter, and a knowledge base of places. In computing terminology, a knowledge base is a set of facts, rules, exceptions, and assumptions that guide a programme while solving a problem.
“We’re trying to explore a lot of knowledge bases. So we wanted to feed knowledge to the machine, and figure out how to make it, gather and appreciate knowledge, much like how we try to learn,” Karky says. “We’re developing algorithms and methodologies to feed in knowledge, and get the machine to ask questions to people, and try to learn and update its memory.”
To that end, Karky and team started building a knowledge base of places: continents, countries, states, capitals, rivers, mountains, places of interest etc. “Our lyric machine can now take this knowledge and generate new similes, based on countries, the food available in a country, taste of a food…” he says.
Building the building blocks
“We have built a lot of other tools which are necessary for AI and to enable people to learn Tamil,” Karky says, describing tools built in-house by Karky Research Foundation. The non-profit, which he heads and is mostly self-funded, and has been active since 2013. Sixteen people are working at the foundation at present, which primarily focuses on language computing and language literacy, and has published free language focused tools and applications online. These include Chol, a online Tamil-English-Tamil dictionary; Piripori, a morphological analyser and compound word splitter for Tamil; Olingoa, a tool for transliteration; and Aadugalam, a portal for word games.
“We have around 3.5 lakh words in the dictionary. And it’s growing every day,” Karky says about Chol (pronounced “chollu” and translating to “say” in English. “I call this as a foundation for all the other tools that we are developing,” he says. The dictionary also ranks each word based on its popularity and how it pleasant it sounds. The pleasantness score is based on where the sound originates from your vocal chords. “Many songs of mine where I have used these techniques have been huge hits. More than that, it saved me a lot of time,” he says. Some of his top hits are Google Google, Irumbile Oru Idhaiyam (a heart, encased in iron), Selfie Pulla (Selfie kid), and Megha Ragame (musical clouds), a song made entirely of palindromes.
In several of his songs, he has used Emoni, a Tamil rhyme finder, Karky says. “I’ll be stuck with a rhyme, when I want a new word,” he says. “In one song in a movie called 180, there was a song where the line which I actually wrote was ‘Sindhikaadhu Sindhidum, Megamai Peiyappogiraen’ (I’m going to rain inconceivable thoughts like a cloud),” he says. The music director was unhappy, as he didn’t want Megam (meaning cloud) there. He wanted a word with a short vowel. “I used Chol to find the equivalent words and then I was able to find this word – Kondal (meaning cloud)” he says.
If you have a Kindle, you will be able to see the English translations to Tamil words thanks to their efforts. This feature was introduced in March 2017. “For Amazon’s Kindle device, we provide three dictionaries – Tamil to English, English to Tamil, and Tamil to Tamil,” he says.
Minding the language gap
Karky, who has a Ph.D. in information technology from the University of Queensland, Australia, has been working on NLP (natural language processing) problems since his college days. His obsession continued when he started work as an assistant professor at Anna University College of Engineering, Guindy, where he worked on language computing projects.
Now 38, he says that the foundation is building the tools for the future with an aim to enable Indian language support for voice recognition and voice generation. “I use Alexa, but I can only do it in English. My son actually changes his accent so that Siri can understand and deliver. There’s going to be a generation using these devices, they’re just going to be focused on English,” Karky says.
Karky Research foundation has a ten-year plan: most of the tools the foundation is working on are built to prepare them for a voice-driven computing future. “When typing becomes obsolete, we want all our Indian languages to be really good with speech recognition.” he says.
As someone acquainted with the nuances of language, he says that contemporary translation tools like Google Translate work only 15% of the time in Tamil. He attributes it to Tamil’s morphological richness. “In English, a noun can take maximum of five to six different forms – like adding ‘s’ or ‘es’ or ‘ing’ to it. In Tamil, for a noun, we have 280 variations. A single verb can transform into more than thousand different variations. Capturing these nuances and understanding what the verb actually means is a very tough task for a machine,” he says.
He shared a few examples of how Google search fails at this. “Type a word like ‘Marangalkulle’ (மரங்கள்குள்ளே, which means “inside a tree” in English), it will only show you results where the exact word is there. It won’t show you results with ‘Maram‘ (tree) in it.. Or the reverse. Google doesn’t know the difference between a keyword and a variation of it,” he says.
Tamil, a language spoken by some 80 million people in the world and dating back to the 5 BCE, has a lot of compound words where you can merge two words together and form a new word, he says. For example, “Mor (buttermilk) and kuzambu (broth), you can merge it (Morkuzambu). That’s a totally new word that is not in the dictionary. Google won’t know what is Morkuzambu,” he says. True to his claim, we found Google Translate failing to translate this term, it just transliterated it, instead. This wasn’t a one-off instance. Google Translate also failed to translate பொற்குடம் (Porkudam or gold pot), transliterating it instead.
Karky Research Foundation’s Piripori tool is precisely for this purpose – it splits compound words to find the root words and can now handle nearly 350 million variations of Tamil words. This can improve search and translation efforts, he says.
“Most of these companies work on statistical methods – that’s not something I’m a big fan of. Because we can’t predict how the system is going to behave. If there are correlations between two documents, it translates incorrectly,” Karky says. A single word in Tamil can have have multiple meanings based on the context. “The word padi, in Tamil, for example has around 25-30 meanings. It means step, it means read, it’s also a measure. It depends on where it is occurring, what are the morphological variations. These are areas where a statistical machine translation system can trip up,” Karky says. “I prefer it to be rule-based. The system should be 70-80% rule based and 20% can be statistical. That’s where I am trying to move it towards,” he says.
Why haven’t the tech giants been able to bridge the Indian language gap yet? Devadath V V, Computational Linguist, Dheeyantra Research Labs, a Bengaluru-based chatbot startup, believes the prism of English or Indo-European languages is what is tripping up the giants. “They are not able to find how Indian languages are built. These are morphologically very rich languages. If we look at it from the perspective of linguistics, Indian languages do not fit into their theories most of the time. They usually go for machine learning-based systems, where they don’t really give much care to the linguistics part. They get a lot of data from various sources, it’s easy for them to scale up,” he says. Most Indian universities aren’t doing much work on Indian languages either, he adds. “As far as I know, only IIIT Hyderabad and IIT Bombay are trying to make resources for Indian languages.”
Karky says that the foundation plans to create language tools, starting with south Indian languages first and then to north Indian languages. “Our algorithms are language independent. We are starting to build the Chol dictionary in Telugu and have created around 15,000 words, with examples and proper definitions for it,” he says.
A proper Tamil translation tool is still a two to three years away, says Karky. “We’re not even thinking about translation. With whatever tools we have, if we start translation now, it’s going to be half-baked,” he says.
One problem area that the Karky Research Foundation is trying to solve is named entity recognition – to be able to identify a name in a sentence and a word. “When you say “Amaichar Ponnmodi anga vandhaar” (Minister Ponnmodi went there), I don’t want to translate Ponnmodi… my machine should not translate it as “Minister Goldhair,” he says.
Subscribe to FactorDaily
Our daily brief keeps thousands of readers ahead of the curve. More signals, less noise.
Disclosure: FactorDaily is owned by SourceCode Media, which counts Accel Partners, Blume Ventures and Vijay Shekhar Sharma among its investors. Accel Partners is an early investor in Flipkart. Vijay Shekhar Sharma is the founder of Paytm. None of FactorDaily’s investors have any influence on its reporting about India’s technology and startup ecosystem.