Written by 8:04 pm ChatGPT, Generative AI, OpenAI, Uncategorized

– Uncovering the Truth: Comparing ChatGPT’s Accuracy with Bard, Claude, and Copilot

To err is human and apparently AI as well.

Artificial intelligence (AI) that is generative is known to be susceptible to factual errors. If you’ve requested ChatGPT to generate 150 presumptive facts and wish to avoid spending the entire weekend manually verifying each one, what should you do?

Moreover, in the realm of AI in 2023, a season of discoveries has reshaped humanity significantly.

I sought assistance from various AI models in my predicament. If you’re intrigued by navigating through this intricate maze of similarities, I’ll delve into the task, evaluate the performance of each AI in a fact-checking scenario, and share some final reflections and cautions.

The Project

Recently, we embarked on a fascinating project where we tasked DALL-E 3 to work within ChatGPT to create 50 unique images representing each US state, alongside three intriguing facts about each state. The results were described as “gloriously odd,” capturing a sense of abstract romanticism.

While ChatGPT placed the Golden Gate Bridge in Canada and Lady Liberty in both Manhattan and the Midwest, it also generated two Empire State Buildings. Despite the abstract outcomes, the overall results were quite impressive.

Furthermore, I tasked DALL-E 3 with creating a portrait for each US state, resulting in incredibly bizarre outputs.

On the other hand, the factual details provided were generally accurate. As someone well-versed in US geography and history, I noticed only a few glaring inaccuracies in the facts generated by ChatGPT. However, I did not conduct independent fact-checking and deemed the results satisfactory based on a cursory review.

But what if we desire a more in-depth analysis of the accuracy of those 150 generated facts? Such a query seems tailor-made for an AI task.

Methodology

However, I harbored doubts about the credibility of fact statements generated by GPT-4, the OpenAI large language model (LLM) utilized by ChatGPT Plus. It felt akin to tasking high school students to craft a research paper without proper citations and then expecting them to rectify any inaccuracies post-facto. Starting with potentially erroneous information and then allowing for corrections seemed inherently flawed to me.

Additionally, two groundbreaking innovations in 2023 propelled technology into a new era.

To explore the accuracy of these facts further, I engaged other AI models housing distinct LLMs. Bard from Google and Claude from Anthropic each possess their own LLMs. While Bing employs GPT-4, I opted to scrutinize its performance for thoroughness.

As you will see, Bard provided insightful feedback, but I subjected ChatGPT to a round-robin challenge against its responses, leading to an intriguing experiment.

Claude the Anthropologist

Antoine utilizes the Claude 2 LLM, also integrated into Notion’s AI application. Upon sharing a PDF containing the complete set of facts (excluding images) with Claude, the feedback received highlighted the overall accuracy of the fact list, albeit with some critiques regarding the lack of nuanced explanations due to a set maximum length constraint. Claude’s evaluation offered constructive criticism while acknowledging the general appropriateness of the facts.

Nopilot or Navigator?

Lastly, we turn to Microsoft’s Copilot, also known as BingChatAI. Attempting to paste the text of all 50 state facts proved challenging due to Copilot’s restriction to 2,000 characters. Despite my efforts to guide it through the fact-checking process, Copilot’s responses mirrored the information provided for validation, a peculiar outcome given its utilization of the same LLM as ChatGPT. This divergence in performance between the two AI models was intriguing, prompting me to shift focus to Bard.

Bard

Google recently introduced the GeminiLLM, and while I conducted tests using the Google PaLM 2 model in the absence of access to Gemini, Bard’s performance stood out significantly compared to Claude and Copilot, akin to a triumphant colossus.

While Bard demonstrated impressive fact-checking capabilities, it occasionally faltered in its assessments, mirroring the fallibility inherent in all AI systems. Despite some inaccuracies, Bard’s thorough evaluation was commendable, albeit with room for improvement.

Recommendations and Warnings

It is imperative to verify facts meticulously before finalizing any document or submission to ensure accuracy. As evidenced in the evaluations, while the results may appear promising, they can be partially or entirely inaccurate, underscoring the importance of fact-checking.

The process of cross-checking different AI models yielded intriguing insights, showcasing both the capabilities and limitations of these systems. While Bard excelled in many aspects, errors were still present, reaffirming that fallibility is not exclusive to humans but extends to AI as well.

In conclusion, the journey through fact-checking with AI models revealed a blend of accuracy and errors, emphasizing the need for critical evaluation and verification in informational tasks.

Visited 2 times, 1 visit(s) today
Last modified: February 3, 2024
Close Search Window
Close