AI

[INTERSPEECH 2024 Series #2] NL-ITI: Enhancing LLM Truthfulness Through Internal Modifications

By Jakub Hoscilowicz Samsung R&D Institute Poland
By Adam Wiacek Samsung R&D Institute Poland
By Jan Chojnacki Samsung R&D Institute Poland
By Adam Cieslak Samsung R&D Institute Poland
By Artur Janicki Samsung R&D Institute Poland

Interspeech is the world’s leading conference on the science and technology of speech recognition, speech synthesis, speaker recognition and speech and language processing.

The conference plays a crucial role in setting new technology trends and standards, as well as providing direction for future research.

In this blog series, we are introducing some of our research papers at INTERSPEECH 2024 and here is a list of them.

#1. Relational Proxy Loss for Audio-Text based Keyword Spotting (Samsung Research)

#2. NL-ITI: Probing optimization for improvement of LLM intervention method (Samsung R&D Institute Poland)

#3. High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model (Samsung Research)

#4. Speaker personalization for automatic speech recognition using Weight-Decomposed Low-Rank Adaptation (Samsung R&D Institute India-Bangalore)

#5. Speech Boosting: developing an efficient on-device live speech enhancement (Samsung Research)

#6. SummaryMixing makes speech technologies faster and cheaper (Samsung AI Center - Cambridge)

#7. A Unified Approach to Multilingual Automatic Speech Recognition with Improved Language Identification for Indic Languages (Samsung R&D Institute India-Bangalore)

Introduction



As AI becomes increasingly embedded in our daily lives, ensuring that these systems behave safely and align with our values and needs is crucial. This concept, known as AI Alignment, goes beyond technical reliability—it's about controlling how AI models like Large Language Models (LLMs) behave, ensuring they act in ways that are consistent with our expectations. For brands like Samsung, this means making sure that AI-driven products provide accurate information and uphold the brand's reputation by handling sensitive issues with care. AI Safety is essential in maintaining trust between users and technology, whether it's through customer service chatbots, virtual assistants like Bixby, or other AI applications.

AI Alignment isn’t just about truthfulness. There are broader applications where we might want to influence or "bias" LLMs to behave in specific ways. For example, an LLM could be guided to adopt a kinder tone in customer interactions or even be customized to exhibit a particular personality type that suits a brand's identity. Techniques that allow for the control of AI models are becoming increasingly important as we seek to deploy AI systems that are not only effective but also aligned with human values and corporate goals.

Traditional methods like fine-tuning—where models are adjusted with additional training data—often fall short in achieving these nuanced objectives. They may not fully address how the model internally processes information, leading to inconsistent or undesirable outputs. This is where representation engineering, including techniques like the Non-Linear Inference Time Intervention (NL-ITI), comes into play. NL-ITI offers a way to modify the internal workings of LLMs, allowing us to guide their behavior to make them more truthful.

Method



The NL-ITI method enhances the Inference Time Intervention (ITI) paradigm, which was designed to improve LLM outputs by modifying internal model activations during inference. The ITI method unfolds in two phases. Initially, linear probing models are trained on the representations returned by the attention heads for a given probing trainset. The probing operation, as mathematically described in the equation below, uses labeled data (e.g., question-answer pairs) to identify attention heads that store truthful information. The probing model is defined as:

Where θ represents a set of trainable parameters of linear probe model and x is the vector representation that corresponds to a token at a specific attention head and layer. The accuracy of this probing model is assumed to correlate with the amount of desired knowledge, such as truthfulness, encoded in the attention heads.

During the second phase, the ITI framework intervenes by shifting the activations of these selected attention heads in the truthful direction during inference. The intervention is defined as:

Where x is embedding corresponding to l-th token, θ represents the intervention biasing vector and α is the intervention strength.

The ITI approach identifies specific attention heads that are hypothesized to store relevant information, such as truthfulness, and intervenes by shifting their activations. However, the original ITI method relies on a linear probing model, which may not fully capture the complexity of the concept representations within the model. NL-ITI introduces two key improvements to address these limitations. First, it replaces the linear probing model with a non-linear probe, which has a higher capacity for capturing complex information. This change allows for a more appropriate choice of attention heads (Figure 1), that is crucial for directing the model towards generating more truthful and accurate responses.

Figure 1. Probing accuracy for each attention head of the LLM on TruthfulQA dataset for linear probing (ITI) – bottom, and non-linear probing (NL-ITI) – top. Accuracy results are ‘smoothed’ between neighboring attention heads (lower standard deviation).

Second, NL-ITI extends the intervention biasing vector to consider an average of vector representations of multiple tokens, rather than focusing solely on the last token from question-answer pair. This multi-token approach captures a more comprehensive context, enabling more effective interventions that enhance the truthfulness of the model’s outputs. From many experiments that we performed, those two modifications resulted in most significant quality improvement.

Result



The results of implementing NL-ITI are compelling and demonstrate its potential for improving LLM behavior in practical, brand-sensitive applications. When tested against popular benchmarks, including TruthfulQA, OpenBookQA, and MMLU, NL-ITI consistently delivered better performance than the original ITI method. For example, on the TruthfulQA benchmark, NL-ITI achieved a significant 14% improvement in the MC1 score—a measure of accuracy—over the baseline ITI approach (Table 1., Figure 1.).

Table 1. Comparison between baseline (LLaMA-2-chat-7B model, TruthfulQA dataset), ITI and NL-ITI. Compared values have all been achieved with few-shot-prompting.

Figure 2. How MC1 correlates with KL divergence. The results were collected for ITI and NL-ITI using different hyperparameter sets (α, heads to intervene) for TruthfulQA and OpenBookQA datasets. On each benchmark, baseline LLaMA-2-7B performance is shown.

Moreover, NL-ITI’s ability to generalize across out-of-distribution benchmark was tested. The method outperformed ITI, demonstrating better adaptability and robustness (Table 2). These results underscore NL-ITI’s effectiveness not only in improving truthfulness but also in enhancing the overall reliability of LLMs across diverse scenarios.

Table 2. Comparison of generalization of ITI and NL-ITI on out-of-distributions benchmarks: AI2’s Reasoning Challenge, Massive Multitask Language Understanding, and OpenBookQA.

For a brand like Samsung, these improvements mean that AI-driven customer interactions can be both more accurate and brand-safe, whether through a virtual assistant like Bixby or in other sensitive applications. By ensuring that AI systems provide consistent, truthful, and aligned responses, NL-ITI helps protect the brand’s reputation and enhances user trust, making it a valuable tool in the ongoing development of AI technologies.

Link to the paper



https://arxiv.org/pdf/2403.18680