Automatic Generation of Meeting Minutes from Online Meeting Transcripts

Published by Harihara Vinayakaram Natarajan, Kiran P V N N, Kunal Kasodekar, Ayushri Arora and Jaanam Haleem

Wipro Tech Blogs
7 min readOct 26, 2021

Online meetings have made communication easier, effective, and efficient. The COVID-19 pandemic and the resulting “Work from Home” (WFH) orders have led to a significant increase in this mode of communication. For example, as of November 2019, there were 20 million active users on Microsoft Teams each day. By April 2021, this number has increased to a whopping 145 million, an increase of 625% in just 16 months. Most modern-day web-conferencing tools have the ability to record meetings and generate meeting transcripts. These transcripts can be used to generate a summary (minutes) of the meeting, along with the action items (a “to-do list”). The latest Natural Language Processing (NLP) models provide us with the opportunity of achieving this objective. Applications that can generate meeting summaries and action items will be very useful.

Photo by LinkedIn Sales Solutions on Unsplash

Neural-network-based language models have shown a lot of promise in the area of Natural Language Generation (NLG) tasks. The breakthroughs and developments are occurring at an unprecedented rate. This was made possible by increased resources in the form of large text datasets, and cloud platforms that train large models. The models are first trained on large datasets and then the trained models solve the required task. This process of using pre-trained models to solve actual problems is known as transfer learning. Within deep learning, pre-training is the de facto approach for transfer learning. The pre-trained model is fine-tuned to do tasks like text classification, part-of-speech tagging, named entity recognition, text summarization, and question-answering. However, the most important factors are the discovery of transformers, their architectures, and the use of transfer learning in NLP. This technology can be used for preparing the meeting summary and action items based on the meeting transcript generated by the web conferencing software.

About Language Models

Recently BART, T5 & GPT demonstrated the efficacy of transformer models on various NLP tasks by using pre-trained language models on large-scale datasets.

BART: BART uses a standard seq2seq/machine translation architecture with a bidirectional encoder and a left-to-right decoder. The BART pre-trained model is trained on CNN/Daily Mail data for the summarization task. The pre-training task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.

T5 (Text-to-Text Transfer Transformer Model): T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing, which requires an input sequence and a target sequence. The input sequence is fed to the model using input_ids. The target sequence is shifted to the right, i.e., prepended by a start-sequence token and fed to the decoder using the decoder_input_ids. In teacher forcing style, the target sequence is then appended by the EOS token and corresponded to the labels. The PAD token is hereby used as the start-sequence token.

GPT-3 (Generative Pre-Trained Transformer 3): GPT-3, like other models, uses an encoder-decoder model, but it is trained with about 45 terabytes of text data from multiple sources, including Common Crawl, Web Text2, Wikipedia, and several books. The GPT-3 is not one single model, but a family of models. Each model in the family has a different number of trainable parameters. The largest version GPT-3 175B or “GPT-3” has 175 billion parameters, 96 attention layers, and a batch size of 3.2 million. While language models like BERT use the encoder to generate embeddings from the raw text, which can be used in other machine learning applications, the GPT family uses the decoder half, so they can take in embeddings and produce text.

Experiment Details

We conducted an experiment to evaluate the efficiency of these models in generating the meeting summaries and action items. Using the Microsoft Teams’ transcript generation feature, multiple meeting transcripts were generated for various internal meetings. The transcripts were manually corrected for any errors, due to the inaccuracies of the built-in speech-to-text engine of Microsoft Teams. The corrected transcripts were used for automatic generation of the meeting summarization and “to-do list”. To achieve this, a solution was conceptualized, designed, and developed. The API of these language models was used for the automatic generation of meeting minutes and to-do lists.


The summarization output given by each of these models was different, as each model has its architecture and weights. The output was evaluated against the following parameters:

  • Abstraction for Summarization: This parameter takes an excerpt from the transcript by the omission of words without the sacrifice of sense.
  • Extraction for To-Do List: This parameter takes out the important points of a substance from the transcript.
  • Style: This parameter evaluates the discourse structures, narrative flow, and actuality.
  • Adequacy: This parameter evaluates whether the essence of the transcript is adequately represented in the output.
  • Coherence: This parameter evaluates whether the generated text is logically organized or not. It should not just reproduce the input text.

The performance of each model was compared against the above parameters and the results are documented with an explanation below.

Let’s dive deeper into each parameter.

Abstraction: The generated minutes are expected to provide an abstractive summary of the meeting proceedings. It is expected that the model generates the summary in its own words and sentences. Also, it should compress and reformulate the input transcript, preserving the meaning. BART and T5 models fared poorly; many times the input text was reproduced. However, GPT-3 out-performed in this area. It displayed an ability to paraphrase information and included the external knowledge thanks to large volume of data (175 billion parameters) with which it was pre-trained.

Extraction: For to-do list generation, the model is expected to select a number of segments (action items) from the meeting transcript to make up a summary. BART and T5 models couldn’t identify the action items, whereas GPT-3 was able to pick some of the action items and generated a decent summary, although it did miss out few of the action items.

Style: This parameter evaluates whether the model is able to generate text with better discourse structure and narrative flow, the text is factual, and, finally, the tone of the generated text is good or not. BART and T5 models reproduced the input transcript many times. GPT-3 also performed very well in this parameter.

Adequacy: This parameter evaluates how much of the meaning expressed in the transcript was also expressed in the generated summary text. Due to limitation of tokens, the text generated was inadequate. While the useful information was spread across multiple places in the input transcript, T5 and BART reproduced the sentences at the beginning of the transcript multiple times; GPT-3 fared better and generated a very abridged summary often in 3 to 4 sentences. This is obviously not good for a summarization task.

Coherence: This parameter evaluates on how well the text fits in the provided context — do the sequences make sense? Unlike other models, GPT-3 doesn’t merely reproduce key sentences, but generates entirely new text. However, GPT-3 output is never consistent. Two summaries with the same prompt and the same hyper parameters generated one after the other will seldom give the same result.

Next Steps

As for next steps, this solution can be advanced if the quality of Microsoft Teams’ transcript generation can be improved. It can be tested further by using various Automatic Speech Recognition (ASR) engines, instead of native engine available in Microsoft Teams. The most effective ASR engines for this task should be trained on large datasets covering the linguistic diversity, as there are thousands of different languages in the world, with even one language varying from speaker to speaker. The possibility of making models that robustly and equitably represent language with both its major and subtle variations remains an open challenge.

Further, it is also important to train language models like GPT-3 with domain specific knowledge, depending on the area where it is getting deployed. While GPT-3 is remarkably large and powerful, it is not constantly learning. A continuous learning loop on a pre-trained model will definitely be useful. Transformer architectures have the limitation of input text length and slow inference time.

In response to the barriers in developing models, there are many large, ongoing open-source efforts that seek to make non-licensed powerful and free models.

Final Thoughts

Though the Microsoft Teams ‘Transcript Generation’ option is surprisingly versatile with its linguistic knowledge, obtained from pre-training, it has limits to adapting to the language variations, mostly due to differences in pronunciation. The conversation and grammatical constructions that people reach for in meetings manifests differently from that of written language. Therefore, it becomes necessary to manually review and correct the transcript, which is a big irritant that dismisses the entire value proposition.

There are quite a few limitations in all these language models for it to be used for production, but GPT-3 is a huge leap forward in right direction. Sam Altman, who co-founded OpenAI with Elon Musk, has rightly summarized by saying “AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out”.

This experiment showcases pre-trained models on vast internet scale datasets seem to be good at synthesizing text. For these next-generation language models to successfully address downstream tasks, it is important to capture and assimilate real-world information from different sources and domains and adeptly handle high volumes of data.