A Tale of Two Languages: The Code Mixing story
by Arindam Chatterjee
“So long as they possessed symbols, ordering of the symbols and meanings partially determined by those components in conjunction with context they had language.”
-Daniel L. Everett
Language(s) is arguably the first invention of mankind, even before the invention of the wheel. Evolution of languages is as old as its inception. With the advancement of intellectual faculties in man, methods, and modes of communication have also improved, ensuring the evolution of languages. The modern evolved state of languages, especially in informal contexts, necessitates the use of multiple languages. This development is predominantly evident on social media, blogging websites, chat platforms, etc. leading to the emergence of a global linguistic phenomenon called code mixing or code switching.
Code-mixing is a global phenomenon. Not only has it become a norm in multilingual countries, but is also spreading across countries that have predominantly one language like France, Germany, etc, giving rise to new code-mixed languages like Franglais (French + English) or Denglish (Deutsch + English). The image shown below illustrates the current span of code-mixed languages across the world.
According to a recent report by the International Data Corporation (IDC), 80% of online data is textual in form. A majority of the online textual data is informal, which makes it highly probable to be code-mixed in nature, given the rapid spread of code-mixed usage across the world, as shown above. The state-of-the-art NLP models and applications are developed only in single languages (monolingual cases), especially English. This makes it impossible to harness and process this huge chunk of online data, making it imperative to build models and applications in code-mixed languages.
In this article we discuss the nuances of code-mixing, the major challenges of this domain, and how we have overcome some of these challenges to build our novel code-mixed models and applications, giving us an improvement of 82% and 43% over the existing state-of-the-art models for Hinglish and Spanglish.
What is Code-mixing ?
The contemporary evolved state of informal languages is the amalgamation of languages at a syntactic level, called Code-Mixing or language mixing. Code-Mixing is a social norm, observed majorly in multilingual societies. It is quite prevalent in social media conversations in multilingual regions like — India, Europe, U.S., Canada, and Mexico.
Multilingual people, who are non-native English speakers, tend to mix languages using English-based phonetic typing and insertion of anglicisms in their native language. This phenomenon is observed owing to the fact that certain words in particular languages are more popular and better accepted. Although popularity and acceptance of phrases do not necessarily dictate all code-mixing instances, it is the primary reason for mixing languages. Let us now look at some aspects of code-mixed languages:
1. Switching Points — Switching Points or Code-switch points are junctions in a code-mixed text, where the language switches. For example:
i. Spanglish: ElSPA signENG (the sign)
ii. Hinglish: RelaxENG karoHIN (do relax)
Switching points pose a major impediment while processing or building applications for code-mixed languages. The rare occurrences of the switching points make it very difficult for AI engines to learn them.
2. Major Language: In a code-mixed text, the language which has the maximum number of words is the major language.
3. Matrix language: The matrix language in a code-mixed context, is the primary language of the speaker. The grammar or structure of the matrix language is congruent with that of the code-mixed text.
4. Code-mix Index (CMI): When languages mix, they mix to various degrees. When comparing different NLP techniques for code-mixing, it is essential to have a measure to quantify how much mixed the data is, in particular, since error rates for various language processing applications would be expected to escalate, as the level of code-mixing increases. To measure the level of mixing of languages in a code-mixed corpus, an index called Code-mixing Index has been defined in the paper “Comparing the Level of Code-Switching in Corpora”, using parameters like total number of words, number of switching points, number of words in each language, etc.
Code-mixed Language Models and Applications
Language Models are models that impart the understanding of a language and its intricacies to a machine. Language models are typically built using the statistical significance of words called Statistical Language Models (SLMs). In the recent past, language models trained using neural networks, called Neural Language Models (NLMs) have emerged as a major player in the AI domain. With the advent of the Transformer architecture (Google, 2017), NLMs have aided in building exceptionally high accuracy AI systems with language models like BERT (Google, 2019), GPT-3 (Open AI, 2020) powering them. These models have completely changed the manner in which NLP applications are built, bringing forth a revolution in the NLP-AI space.
The accuracy of a language model is measured in terms of how confused the model is in predicting the next word, given a context. This metric is called Perplexity (PPL). A better language model produces a lower perplexity score.
The state-of-the-art language models handle only single languages. Code-mixed language models need to be constructed as a building block, in order to develop applications in code-mixed languages. Although language models have evolved over the past decade and have gained significant attention, code-mixed language modeling still remains a sparsely explored domain.
Given the spectrum of multilingual societies across the world, we address the relevant work in code-mix for the topmost spoken languages, viz. Mandarin Chinese, Spanish, and Hindi. For Mandarin-English code-mixed language models, Genta et. al. report a perplexity of 127 for Mandarin to English. For Spanglish (Spanish-English language pair) code-mixed language models the state-of-the-art is the work by Gonen and Goldberg, who report a perplexity of 40. For Hinglish (Hindi-English language pair) there are only a handful of significant contributions, among which the best perplexity is reported by Pratapa et. al. as 772.
Once code-mixed language models are built, a plethora of AI applications can be built from these models. A few of such applications are outlined below:
- Chatbot — Chat interfaces across domains in code-mixed languages
- Speech Recognition — Converting audio signals to code-mixed text
- Machine Translation — Translating code-mixed languages to matrix language
- Natural Language Generation — Generating code-mixed text
- Text Summarization — Creating a summary of a large body of code-mixed text
- Intelligent Personal Assistant — Virtual assistants like SIRI, Alexa which can understand and converse in code-mixed language and context
- Smart Keyboard — Code-mixed keyboards typically used for typing on smart devices, equipped with word completion and next word suggestions
- Spell/Grammar Checker — Automatic detection and correction of incorrect spelling and grammar for code-mixed languages
Challenges in code-mixing
In addition to mixing languages at the sentence level, it is also fairly common to find code-mixing behavior at the word level. This linguistic phenomenon poses a great challenge to conventional NLP systems. The major challenges that lie when trying to build models and applications in code-mixed languages are as follows:
Mixed words across languages: As code-mixing has no predefined set of rules, word mixtures are a highly observed occurrence. For example, words in Hindi are used with English inflections like darofy — ‘dar’ (fear) + ‘fy’ (inflection in English)
The mixture of grammar of constituent languages: As the mixing of languages is informal in nature, users tend to mix sentence structures of the member languages. For example:
a. Code-mix: Main khatam karunga job
b. English translation: I will finish the job
c. Correct form: In the above example, the sentence “I will finish the job” is written in code-mixed Hinglish with the structure of the English language. The correct form with Hindi grammar would have been: “Main job khatam karunga”
Multiple word forms: When languages with native scripts are code-mixed with English particularly, the transliteration of words in the native language to English varies in form. This happens because of the unavailability of a standard romanized form of words of such languages. This is observed especially when Indian languages like Hindi and Bengali are mixed with English. For example, the Hindi word ‘है’ (English translation: ‘is’) in romanized form may have the variations: hain, hai, hei, hein, he
Switching points: Switching Points are the tokens in the text, where the language switches. Switching points have rare occurrences in the corpus. Such sparse occurrences of switching points make it difficult for any Language Model to learn their probabilities and context. Obviously, code-mixed language models fail at switching points. This is the primary bottleneck for code-mixed models.
Dataset: As research in the code-mixed domain is very limited, datasets are not available, especially large ones, that can be used to train language models for code-mixed languages.
Our Work on Code-mixing
In this section, we lay out the initial path that we created for ourselves, the novel models and applications we created, the accuracies we achieved, and the research we have published as well as the work which is submitted for publication.
Our Roadmap
When we set out on our path on building models and applications for code-mixing, we decided to work on Spanglish and Hinglish, with Spanish, English, and Hindi being three of the top five spoken languages globally. We were impeded by the lack of quality code-mixed data for these languages, hence we first planned to extract a reasonably large corpus for our research and then build models and applications for these languages.
Datasets
As code-mixing is an almost unexplored area of research, not a lot of data is available for research. A dataset in Mandarin-English called SEAME was created in 2010 with around 110 hours of speech transcripts. For Spanglish, the LinCE dataset provides a small but reasonably good quality dataset. No similar dataset was available for Hinglish.
Several research works in code-mixed languages have been built on synthetic data. While accumulating data for our models, our focus was to extract naturally occurring and naturally distributed data. This ensures that the models built would capture the real-world features of code-mixed languages. We used Twitter as our data extraction platform.
For Hinglish, we extracted code-mixed tweets for over six months, to get a total of around 6 Million code-mixed sentences in Hinglish. As far as we know, this is the largest code-mixed dataset available. The table on the left shows details of the Hinglish and Spanglish datasets across different CMI Ranges (degree of mixture of languages in a code-mixed context, as discussed previously)
Statistical Language Models
After the data extraction phase, the obvious forward step is to build language models. Since, code-mixed language modeling is a completely new domain, where not a lot of exploration has been done, we decided to start with statistical language models, which are the traditional form of modeling languages, using probabilistic or count-based measures. For details of the techniques discussed in this section, please refer to our accepted work in LREC 2020.
We experimented with several smoothing techniques on our n-gram language models viz., Kneser Ney (KN), Written Bell (WB), Absolute Discounting (ABS) and Good Turing (GT), on our code-mixed Hinglish dataset, as exhibited in the table below. The best perplexity (655) was obtained using Good Turing, which is a 15% improvement over the existing state-of-the-art for Hinglish language models.
In order to further improve the perplexity of our statistical language models, we experimented with 2 novel techniques:
Minority Positive Sampling: We have already established earlier that switching points are the primary bottlenecks for code-mixed language models. The challenge is that these switching points have rare occurrences in the dataset, making it difficult for the models to learn their distributions or patterns. To circumvent this drawback, we sample switching points with low frequency in the corpus, which we call a minority positive sample for switching points. The graph displayed below illustrates the performance of the sampling-based statistical language models.
Bidirectional Statistical Model: Bidirectional approaches for language modeling are predominantly used in neural language models, where the text is scanned both from left to right and right to left while training. We used the same strategy, but for our statistical language model, giving us a perplexity of 439. This is a 43% improvement over the existing state-of-the-art for Hinglish language models.
Neural Language Models
Language Modeling (LM) using neural networks has come a long way from (Bengio et al., 2003) 255 to recent large transformer (Vaswani et al., 2017) based pre-trained language models Generative Pretrained Transformer-2 (GPT-2) (Radford et al., 2019), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) etc. Neural network-based approaches perform better than classical statistical models, both on standard language modeling tasks as well as other challenging NLP applications. In this section we discuss our work on modeling code-mixed languages using deep learning frameworks:
Transformer-based Neural Language Models: The Transformer architecture and associated models based on this architecture, are the state-of-the-art in neural language models. Neural language model architectures like BERT and GPT-2 are extremely data hungry. This is evident from the table below where we obtain perplexity scores of 556.69 for GPT-2, 1398.28 for BERT and 350.37 for RoBERTa (Liu et al., 2019). This is in contrast to the monolingual neural language models using the transformer architecture, where a perplexity score, as low as 16.4 (Krause et. al., 2019) is obtained for English.
Recurrent Neural Network based Language Models: Over the last decade, Recurrent Neural Networks (RNNs) have rendered highly accurate language models over the past decade or so. With the handicap of abundant data on our side, we resorted to RNN based approaches for language modeling (Mikolov et al., 2010). The size of our corpus (6M sentences) was reasonable enough to train an RNN based neural. Also, our corpus consists of code-mixed tweets, which have a maximum length of 140 characters, and the average number of words in a sentence in our corpus is 11 for Hinglish and 7 for Spanglish. Such a setting does not summon the shortcomings of RNN based language models, which inspired the transformer architecture. We achieved an improvement of 66% and 20% over the baselines for Hinglish and Spanglish using BiLSTM architecture as shown in the table below:
Our Novel Fusion Language Model: As code-mixed languages exhibit unique and divergent behavior to other languages, building code-mixed language models using existing strategies and architectures did not suffice. In order to devise a novel strategy for code-mixed languages, we needed to understand the root cause behind the failure of state-of-the-art language modeling techniques to model code-mixed languages. We analyzed code-mixed languages in further detail to understand that switching points are the major bottlenecks when it comes to modeling code-mixed languages. We thus devised a novel strategy to counter the effect of switching points on code-mixed language models, by modeling switching points and text without switch points independently, and consequently fusing these language models, to create our final code-mixed neural language model called Fusion Language Model. The perplexity scores obtained for Hinglish and Spanglish using the fusion language model are displayed in the table below. We achieve perplexity scores of 140 and 23 for Hinglish and Spanglish respectively. This is an improvement of 82% and 43% over existing state-of-the-art neural language models for Hinglish and Spanglish.
Switching Point Prediction Model
As part of our novel Fusion Language Model for code-mixed languages, we built a prediction model for switching points. The state-of-the-art f-measure score for switching point prediction for Spanglish reported by (Solorio and Liu, 2008) is 72%. We used an attention-based bidirectional LSTM technique to get an f-measure of 84%, which is an improvement of 17%. The detailed results are shared in the table below:
Applications
As discussed previously, once language models for code-mixed languages are built, a plethora of applications can emanate from these models. The applications we have built from our code-mixed language models are outlined below:
The MixTa Chatbot
We have built a chatbot that is capable of both understanding and generating code-mixed languages, powered by our novel code-mixed language models. The chatbot is called MixTa (codeMix Tahc — read as codemix chat). The chatbot has an online interface and is available across several channels like WhatsApp and Facebook messenger. Apart from code-mixed languages it is also able to converse in English, Hindi (native and romanized script), Bengali (native and romanized script) and Telugu (native and romanized script). The prominent features of the chatbot are as follows:
- Customizable for a wide variety of domains
- Context understanding from conversation flow
- Automatic spell check/correction feature
- Support for multiple Indian languages
- Support for code-mixed Indian languages
- First, deep learning-powered multilingual chatbot with multiple modalities
- Novel code-mixed language models used in the backend of the chatbot
MixTa makes use of our code-mixed language models, to generate code-mixed word representation or embeddings, and uses the same for detecting the user intent in conversations. The architecture and flow diagram for the MixTa chatbot are given below:
The Antaryami Smart keyboard
Smart keyboards with features like auto-completion, auto-correct and next word suggestion, are non-existent for users of mixed languages. We have built a novel code-mixed smart keyboard for Hinglish called Antaryami. Antaryami means the Omniscient, inspired by the idea that the smart keyboard knows/predicts what one is thinking. We have captured the working of the Antaryami smart keyboard in this video.
We used both our novel statistical and neural Hinglish language models for Antaryami, to generate two separate smart keyboards — Antaryami Statistical Keyboard (ASK) and Antaryami Neural Keyboard (ANK). We then compared the performance of Antaryami against the Google Hinglish Keyboard (GHK). We observed that the GHK was better for English contexts in Hinglish, whereas Antaryami performed better when the context was primarily in Hinglish. For Hinglish, in almost all dataset samples, Hindi is the matrix language, hence a better Hinglish keyboard should successfully predict Hindi context. We also designed a few context settings (categories), based on our dataset to compare the keyboards, as shown in the table below. The Antaryami Neural Keyboard outperforms the Google Hinglish Keyboard by 59% overall, across all language and context settings.
Publications
Published: Minority Positive Sampling for Switching Points — an Anecdote for the Code-Mixing Language Modeling (LREC, 2020)
Submitted: A Tale of Two Languages: Switching-Point Prediction based Fusion Language Model for Code-mixed Languages (CoNLL 2021)