Davinci vs Curie: A comparison between GPT-3 engines for extractive summarization
By Jaanam Haleem, Kiran Venkata Narasimha Naga
“This blog post was not processed by a GPT-3 engine. It was written by a human.”
The science of extracting information from textual data has changed drastically over the past decade. According to a study by McKinsey Digital, CEOs spend almost 20% of their time analyzing operational data and reviewing status reports that could be automated. The application of language modeling in NLP (Natural Language Processing) has made this task possible. The introduction of pretrained language models pushed forward the limits of language understanding and auto-generation of text. NLP language models seem to be driven not only by the enormous boosts in computing capacity but also by the discovery of innovative ways to increase performance.
In the R&D team for the Office of CTO at Wipro, we experiment with various language models to develop points of view, explore various whitespaces, and develop assets that will address the common needs of industry. Previously, we published our solution Generating Minutes of Meeting and To Do List based on Microsoft Teams meeting transcripts.
This article discusses the results of an experiment we conducted to compare various GPT-3 models and provides our views on the best model available for extractive summarization requirements.
About GPT-3 and Models
GPT-3 (Generative Pre-trained Transformer 3) is one of the most popular language models. GPT-3 is an autoregressive language model that uses deep learning to produce human-like text. GPT-3 is trained on over 175 billion parameters on 45 TB of text sourced from all over the internet.
GPT-3 capabilities include creating articles, poetry, and stories using just a small amount of input text. It can also generate text summarizations and write code in languages like Python, CSS, JSX, among others. The quality of the text generated by GPT-3 is so high that it can be difficult to determine whether it was written by a human or not. News articles generated by GPT-3 models can be challenging to distinguish from real ones. Newer versions of GPT-3 are even capable of editing or inserting content into existing text, making it practical to use GPT-3 for content revision, such as rewriting a paragraph of text or refactoring code.
Fine-tuning is an option that is available through the API. Fine-tuning provides results of better quality. Since GPT-3 has been pre-trained on a vast amount of data, when given a prompt with just a few examples, it can often comprehend what task you are trying to perform and generate a likely completion. This is often called “few-shot learning.” Fine-tuning improves on few-shot learning by training on a lot more examples and achieving better results on a wide number of tasks.
GPT-3 offers four sets of models that can understand and generate natural language. Each model has different levels of power suitable for different tasks. These models are Davinci, Curie, Babbage, and Ada. Davinci is the most capable model whereas Ada is the fastest. The core capabilities of these models have inspired a slew of startups across a range of sectors, making GPT-3 the preferred language model for their workload.
Our Experimentations with GPT-3
We evaluated two major engines (Davinci and Curie) for extraction summary, and the results are impressive.
Each of the base series models have different capabilities in terms of speed, quality of output, and sustainability for specific tasks.
1) Davinci Engine
Davinci is the most capable engine and can perform tasks often with less instruction compared to the other models. It is best suited for applications that require creative content generation and extensive understanding of the content.
Another area where Davinci shines is in understanding the intent of text. Davinci is excellent at solving logic problems and explaining the motives of characters. Davinci has solved some of the most challenging AI problems involving cause and effect.
2) Curie Engine
This dynamic, high-speed engine is extremely powerful. Curie excels in performing nuanced tasks like sentiment classification and summarization. Curie is also good at answering questions and performing Q&A and as a general service chatbot.
Test scenario
Davinci and Curie generated summaries for the prompt text below.
Note:
There are some intentional spelling errors (activities, slowing and framework) in the prompt below to test how the engines respond to these mistakes.
Prompt fed to GPT-3:
Davinci result 1
Note:
Usage of pronoun “he” in the summary.
Davinci result 2
Davinci result 3
Note:
Spelling error “framwork” corrected to “framework”.
Observations on summaries by Davinci
In its first attempt, Davinci summarizes by indicating the dilemmas faced by people using social media platforms in organizations and the author’s recommendation to use the Navigation Wheel framework to see what options are available. It also assigns the pronoun “he” to the author.
The second summary discusses the ethical dilemmas faced by decision-makers while using social media and the suggestion by the author to use the navigation wheel. The second summary is pretty much identical to the first one.
Third summary states the dilemmas faced by organizations while using social media and the author’s recommendation to use the navigation wheel. This summary is not very different from the second summary as the same two lines are repeated. Davinci was able to detect and correct the spelling error “framwork” from the prompt in this summary.
Curie result 1
Curie result 2
Curie result 3
Note:
Spelling error “sloing” is not corrected in the summary below.
Observations on summaries by Curie
Curie’s first summary starts with the dilemmas that organizations face while using social-media and rapid publications in organizations, followed by an introduction of the author and then the author’s recommendation to use the Navigation Wheel framework before posting something on social media. It also provides a brief description of the framework.
In its second summary, Curie includes two key points: the author’s experiment of collecting 250 memos from students and the type of dilemmas people faced while using social media in organizations. Even though there are some similarities in the first and second summaries, they are framed quite differently.
The third summary by Curie mentions the author’s experiment, followed by the author’s recommendation of slowing down and mentions the framework consisting of six questions to help make better decisions. Curie did not correct the spelling error “sloing” from the prompt text in its summary. While there is a similarity to the second summary, both the summaries are composed creatively.
Parameters that were used
- Temperature: 0.7
- Response Length: 265
- Top P: 1
- Frequency Penalty: 0
- Presence Penalty: 0
- Best Of: 1
- Show Probabilities: Off
Overall observations
The Davinci model provides a decent summary which is very short and almost the same even after multiple attempts. The Curie model provides a good summary by covering the key elements– author’s introduction, research conducted by the author, the kind of dilemmas that organizations go through while posting on social media, and the description of the framework. From the examples above, it is fair to say that summaries provided by the Curie model cover all the key points, not only offering a better summary of the passage but also a good variation in the three summaries.
While Curie was not able to detect spelling errors, Davinci corrected them.
Although Curie is good at extractive summary, it is not as good as Davinci at solving logical problems. Davinci is better at solving problems related to inference and logic. Curie is better at summarization, language translation, and sentiment classification.
We also tested these models for logical reasoning. Here are some samples.
Davinci — Example 1
Curie — Example 1
Davinci — Example 2
Curie — Example 2
The Davinci model is clearly better at reasoning when compared to the Curie model.
Although GPT-3 is considered one of the best language models, it has its downsides. GPT-3 is subject to inaccuracies and inexactitudes because it was trained on publicly available content. The depth and truth of its responses on any subject is in accordance with the subject’s representation on the Internet.
Comparison between the two models for summarization
While both Davinci and Curie are good at what they do, we found that Curie gives better results for text summarization because the output is exact and captures more detail compared to the output from Davinci. These are some of our observations:
a. Pronouns — Davinci specifies pronouns like she/her without any direct indication of gender. Curie mostly uses pronouns when gender is indicated. This can be seen in Davinci’s summary 1 in test scenario above. The pronoun “he” is assigned to the author.
b. Creativity — Davinci mostly provides a similar summary even after 4–5 attempts. Curie comes up with various creative summaries, allowing the user to pick the best ones. In the test passage, Curie’s results appear to be more creative.
c. Time — Curie is the faster model with an average time of 2.4 seconds per summary. Davinci’s average time is 3.5 seconds per summary. These average times are for the prompt used in the example above, which is about 268 tokens. Tokens are sets of characters, where 1,000 tokens comprise about 750 words.
d. Output quality — Davinci sees through spelling errors, typos, bad examples and gets it right. Based on the testing we have conducted, Curie overlooks errors about 30% of the time. In our test scenario, Davinci corrected spelling errors, but Curie failed to recognize them.
e. Cost — Curie costs 1/10 the price of Davinci per API call.
f. Summarization — Curie is better at summarizing text accurately.
Conclusion
When intelligently applied, GPT-3 is a powerful tool that can help create beneficial and innovative new services. The goal of GPT-3 is to reduce the complexity of machine learning.
Between the two models, Davinci is more sophisticated, and it is likely to produce usable output that is more complex, that explores higher-level domain thinking and analysis.
However, Curie is computationally less demanding and is more than adequate for simpler prompts, reducing latency and API request costs.
We conducted multiple experiments and ran tests across a considerable variety of data, including passages, short stories, meeting transcripts, long passages, and podcasts. Our evaluation led us to pick Curie as the clear winner for extractive summarization. Not only were Curie’s summaries more creative and accurate, they were faster and less expensive.
References
https://beta.openai.com/docs/engines/gpt-3
https://www.scalr.ai/post/business-applications-for-gpt-3
https://docparser.com/blog/what-is-data-extraction/
https://towardsdatascience.com/the-beginners-guide-to-language-models-aa47165b57f9
https://www.techtarget.com/searchenterpriseai/definition/language-modeling
https://www.topbots.com/leading-nlp-language-models-2020/
https://wiprotechblogs.medium.com/generating-minutes-of-the-meeting-automatically-989ddf238e4e