Authors:
(1) Goran Muric, InferLink Corporation, Los Angeles, (California [email protected]);
(2) Ben Delay, InferLink Corporation, Los Angeles, California ([email protected]);
(3) Steven Minton, InferLink Corporation, Los Angeles, California ([email protected]).
Table of Links
2 Related Work and 2.1 Prompting techniques
3.3 Verbalizing the answers and 3.4 Training a classifier
4 Data and 4.1 Clinical trials
4.2 Catalonia Independence Corpus and 4.3 Climate Detection Corpus
4.4 Medical health advice data and 4.5 The European Court of Human Rights (ECtHR) Data
7.1 Implications for Model Interpretability
7.2 Limitations and Future Work
A Questions used in ICE-T method
2.3 Model interpretability
The challenge of interpreting complex decision processes made by LLMs has hindered their application in critical areas like medicine, where there are significant concerns about regulation (Goodman and Flaxman, 2017) and safety (Amodei et al., 2016). Furthermore, this difficulty in understanding the workings of large language models (LLMs) and similar neural network models has restricted their use in domains like science and data analysis (Kasneci et al., 2023). In such fields, the primary objective is often to derive a reliable interpretation rather than merely to implement an LLM (Singh et al., 2024).
The expression of uncertainty in language models is crucial for reliable LLM utilization, yet it remains a challenging area due to inherent overconfidence in model responses. Xiong et al. (2023) and Zhou et al. (2024) both highlight the overconfidence issue in LLMs. Xiong et al. question whether LLMs can express their uncertainty, observing a tendency in LLMs to mimic human patterns of expressing confidence (Xiong et al., 2023). Simlarly, Zhou et al. note that while LLMs can be prompted to express confidence levels, they remain generally overconfident and unable to convey uncertainties effectively, also when providing incorrect responses (Zhou et al., 2024). Ye et al. (2022) add that even when LLMs generate explanations, these may not accurately reflect the model’s predictions nor be factually grounded in the input, particularly in tasks requiring extractive explanations (Ye and Durrett, 2022). However, all the research mentioned above note that these flawed explanations can still serve a purpose, offering a means to verify LLM predictions post-hoc.
It is worth mentioning feature attribution methods, used beyond the LLM realm in multiple deeplearning applications. Feature attributions in machine learning provide a relevance score to each input feature, reflecting its impact on the model’s output. This methodology helps in understanding how and why certain decisions or predictions are made by a model.
The approaches developed by Lundberg et al. (2017) and Sundararajan et al. (2017) both delve into this topic but offer distinct methodologies and theoretical foundations. Lundberg et al. (Lundberg and Lee, 2017) introduced SHAP (SHapley Additive exPlanations), which provides a unified framework for interpreting predictions. SHAP assigns an importance value to each feature for a specific prediction, leveraging the concept of Shapley values from cooperative game theory. In contrast, Sundararajan et al. (Sundararajan et al., 2017) developed Integrated Gradients, another method focusing on the attribution of predictions to input features of deep networks. Unlike SHAP, which uses Shapley values, Integrated Gradients relies on the integration of gradients along the path from a chosen baseline to the actual input. Complementing these approaches, Ribeiro et al. (2016) proposed LIME (Local Interpretable Model-agnostic Explanations), which aims to make the predictions of any classifier understandable and reliable by learning an interpretable model localized around the prediction (Ribeiro et al., 2016).
Another popular method for understanding neural-network representations is probing. Conneau et al. (2018) initially introduced multiple probing tasks designed to capture simple linguistic features of sentences, setting a foundation for understanding how neural networks encode linguistic properties (Conneau et al., 2018).
Clark et al. (2019) focused primarily on the behavior of attention heads within transformers. They observed that these heads often broadly attend across entire sentences, and that attention patterns in the same layer tend to exhibit similar behaviors. Crucially, their research links specific attention heads to traditional linguistic concepts like syntax and coreference, suggesting a direct relationship between the model’s attention mechanisms and linguistic structures (Clark et al., 2019), although there is an ongoing debate on the explanatory power of attention in neural network (Bibal et al., 2022). Unlike Clark et al., who examine what the model attends to, Morris et al. (Morris et al., 2023) explore how information is preserved and can be retrieved from embeddings, offering insights into the reversibility and fidelity of the encoding process. Their method involves a multi-step process that iteratively corrects and re-embeds text, demonstrating the ability to recover most of the original text inputs exactly. Belrose et al. (2023) introduced a technique called causal basis extraction, which aims to identify influential features within neural networks (Belrose et al., 2023). This method stands out by focusing on the causality within network decisions.
In summary, while chain-of-thought prompting can generate errors during inference, requiring complex corrective approaches, in-context learning techniques also face challenges in prompt optimization and efficient retrieval. Furthermore, interpreting large language models remains problematic, exacerbated by models’ tendency to exhibit overconfidence and provide unreliable or unverifiable explanations.
This paper is