The latest advancement in Artificial Intelligence (AI) is represented by Large Language Models (LLMs), which employ deep learning techniques to simulate human writing and handle various Natural Language Processing (NLP) and Natural Language Generation (NLG) tasks. These models, trained on vast amounts of textual data, are capable of tasks like responding to inquiries, summarizing text, translating languages, transforming text, and completing code.
A recent study by a group of researchers has honed in on the detection of hallucinations in grounded generation tasks, specifically focusing on language models, particularly decoder-only transformer models. Hallucination detection involves ascertaining whether the generated text remains faithful to the input prompt or contains erroneous information.
Collaborating on a project, researchers from Microsoft and Columbia University have developed probes to anticipate hallucinatory behavior in transformer language models during in-context tasks. These probes, trained on the model’s internal operations, strive to forecast instances where the model might produce inaccurate information while generating contextually relevant content. To effectively train and assess these probes, a dataset with annotations for both synthetic and real hallucinations is imperative.
The study has brought to light that probes tailored to detect artificially induced hallucinations may struggle to identify naturally occurring hallucinations effectively. This disparity indicates that probes trained on synthetic data may encounter challenges in generalizing to real-world scenarios. The team emphasized the impact of distribution properties and task-specific intricacies on the hallucination data embedded in the model’s hidden states.
The research delves into the intricate nature of intrinsic and extrinsic hallucination saliency across diverse tasks, types of hidden states, and layers within the transformer model. Various techniques, such as sampling responses from an LLM based on inputs and introducing inconsistencies in reference inputs or outputs, have been utilized to capture hallucinations.
Although the latter method reportedly garnered a higher rate of hallucination annotations from human evaluators, synthetic examples are considered less valuable due to their divergence from the test distribution. The team’s primary contributions encompass the development of a dataset comprising over 15,000 utterances categorized for hallucinations in both natural and artificial outputs, encompassing three grounded generation tasks.
Moreover, the study introduces three probe architectures designed to enhance the efficiency and accuracy of hallucination detection compared to existing methodologies. The research also investigates factors influencing probe accuracy, such as the nature of hallucinations (intrinsic vs. extrinsic), model size, and specific encoding components under scrutiny.