Breaking the Jargons #Issue 9
Your weekly Digest of what's happening in the field of Responsible AI.
This week was marred by the discovery of a critical vulnerability (CVE-2024-3094) in XZ Utils, the XZ format compression utilities included in most Linux distributions. Red Hat warns that this vulnerability may "enable a malicious actor to break sshd authentication and gain unauthorized access to the entire system remotely." This security flaw serves as a stark reminder that no software system is impervious to vulnerabilities, and machine learning applications, especially in this era of powerful Gen AI (Generative AI) models, are no exception. Therefore it is important that for mission-critical or otherwise high-stakes deployments of AI, systems should be thoroughly tested and audited. Here is an excerpt from my book that discusses ML security in detail:
As technology and ML play larger roles in human lives, lack of rigor and responsibility in their design, implementation, and deployment will have ever-increasing consequences. Those designing technologies for security or other high-stakes applications have an especially serious obligation to be realistic about today’s ML capabilities and apply process and technology controls to ensure adequate real-world performance.
Machine Learning for High-Risk Applications - Chapter 5, Machine Learning Security
This week's edition of "Breaking the Jargons" contains research papers highlighting the emergence of zero-click worms targeting GenAI-powered applications to innovative frameworks for surfacing health equity harms in large language models. There is also a discussion about a novel browser extension for combating misinformation, exploring human-centered design in explainable AI interfaces, and comparing defensive strategies against attacks on large language models. Additionally, you’ll also enjoy the articles on AI innovation and ethics, the red team philosophy at NVIDIA, and the imperative of openness in AI technology. So let’s get to it …
🔬Research Rundown
Favorite Research Papers I read last week:1
1. Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications
Are your GenAI applications ComPromptMized ☠️?
A new malicious worm called "Morris II" has been designed to infiltrate and spread through interconnected systems powered by GenAI AI models. This worm exploits the AI models by using specially crafted "adversarial self-replicating prompts" that trick the models into replicating the malicious prompts. The study shows how attackers can craft and encode adversarial self-replicating prompts into inputs, such as images and text. These prompts, when processed by the model, cause the model to replicate the prompt as output , perform malicious activities, and propagate the prompt to other agents in the ecosystem. in short, the prompt tricks the model into:
a) Reproducing the malicious prompt (replication)
b) Executing the malicious intended actions (payload)
c) Sending that malicious output to other parts of the ecosystem (propagation)
The researchers demonstrated Morris II against GenAI-powered email assistants in different scenarios - spam attacks, data exfiltration, black-box access, and white-box access. They tested it using text and image inputs across three models - GeminiPro, ChatGPT 4.0, and LLaVA.
Two main classes of GenAI applications are highlighted as potentially vulnerable:
Those whose execution flow depends on the GenAI output
Those using retrieval-augmented generation (RAG) to enrich GenAI queries
Countermeasures
• For replication, the strategy involves rephrasing outputs to ensure they don't mimic the input, potentially implemented directly in the GenAI model or server. Additionally, safeguarding against techniques that attempt to "jailbreak" or bypass restrictions can further protect against replication by preventing direct copying of input to output.
• For threats exploiting retrieval-augmented generation (RAG), using a non-active RAG that doesn't update with new data can prevent worm spread, though it may compromise the system's adaptability and effectiveness.
2. A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models
Assessing large language models (LLMs), particularly for health equity is hard. To reliably evaluate potential biases and equity-related harms in long-form answers generated by large language models (LLMs) for medical questions, researchers have a developed a comprehensive three-part human assessment framework that incorporates a variety of methods:
A multifactorial framework for human assessment of LLM-generated medical answers to identify biases that could lead to equity-related harms. This framework is intended to identify biases across six different dimensions.
EquityMedQA - A collection of seven newly released datasets comprising manually-curated and LLM-generated questions enriched with adversarial queries to test for biases.
An empirical case study applying this assessment framework to Med-PaLM 2 (a large language model from Google Research, designed for the medical domain), one of the largest human evaluation studies in this area
The researchers use these datasets and assessment rubrics to evaluate Med-PaLM 2, involving 806 raters from three distinct groups: physicians, health equity experts, and general consumers. Together, these raters provided over 17,000 ratings.
Interesting observations from the study include the discovery of potential biases in datasets generated by LLMs, which differed from those in manually curated sets. Additionally, the responses generated by Med-PaLM 2 were often preferred over those from its predecessor, Med-PaLM, and actual physicians.
3. A Browser Extension for in-place Signaling and Assessment of Misinformation
The paper explores a concept aimed at shifting power from platforms back to individual users, particularly in the context of misinformation. It proposes a novel approach similar to X/Twitter's Community Notes but with a broader application: a browser extension named Trustnet. This extension allows users to assess the accuracy of content across the web and see assessments from their trusted sources directly in their browsers. Unlike Community Notes, which is confined to a single platform, Trustnet operates in a platform-agnostic manner, working without support from social platforms or websites. Through a two-week user study, the paper sheds light on user engagement with Trustnet, including the types of content they evaluated and their assessment rationales.
4. How Human-Centered Explainable AI Interfaces Are Designed and Evaluated: A Systematic Survey
Do the explanations offered by XAI tools meet user needs?
This study attempts to look into this prevailing issue by focusing on Explainable Interfaces (EIs), that aim to enhance the usability, interpretability, and efficacy of XAI systems through targeted improvements in user interface (UI) and user experience (UX) design. The paper conducts a systematic review of 53 studies to outline the current trends in how humans interact with XAI and suggest future directions for designing and developing more user-centered, explainable interfaces.
5. Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
This paper comprehensively surveys the various attacks targeting LLMs, such as adversarial attacks, data poisoning, or misusing private data. The paper discusses how these attacks work, their possible effects, and ways to defend against them. It also reviews which defense methods are effective, how resilient LLMs are to attacks, and what this means for the safety and trustworthiness of these models. Additionally, the paper introduces a new way of categorizing attacks on LLMs to help researchers assess the field and suggests areas for future study to improve security.
🔖 Curated Blogs and Articles
1. AI Innovation and Ethics with AI Safety and Alignment
This is a nice blog post by Fiddler AI on making AI safer and more aligned with human values and safety protocols. It discusses five Key Areas: Scalable Oversight, ensuring AI works well for everyone, robustness, interpretability, and governance.
2. NVIDIA AI Red Team: An Introduction
The blog introduces NVIDIA's red team philosophy and framework for assessing and mitigating risks associated with machine learning (ML) systems. It nicely outlines their cross-functional approach, combining security experts and data scientists, a methodology covering technical vulnerabilities, model-specific attacks, and potential abuse scenarios,
3. Introducing the Columbia Convening on Openness and AI
What does openness mean for AI, and how can it best enable trustworthy and beneficial AI? Mozilla and the Columbia Institute of Global Politics recently gathered experts from startups, non-profit labs, and groups focused on society's good to discuss making AI technology more open and accessible.
4. Models all the way down
Investigating training sets is an essential avenue to understanding how generative AI models work; the ways they see and re-create the world.
This visual story offers an in-depth examination of LAION-5B, the sole open-source foundational dataset available, shedding light on how it was built and where its images originate. This investigation not only reveals the biases embedded in models but also the systemic biases in the larger AI ecosystem.
📺 Webinar Watch
Differential privacy: from theory to Practice and back by Professor Salil Vadhan of Computer Science and Applied Mathematics, Harvard University
💻 Code Corner: Featured Repositories & Resources
Yet Another Applied LLM Benchmark
Nicholas Carlini recently released a new benchmark for large language models via GitHub. It's a collection of nearly 100 tests that he has extracted from his actual conversation history with various LLMs.
What motivated him when there are already a plethora of benchmarks? In his own words:
“I think it's kind of crazy that we're still mostly evaluating language models by whether or not they can correctly answer some high school science questions. Now that people actually use models to do real things, we probably should have at least a few benchmarks that actually test for real uses.”
💡Tweet Spotlight
Reactions on AI Replacing Human Research Participants?
I hope this edition of “Breaking the Jargons” provides valuable insights and stimulating reads. Until next time, keep reading.
Image Credits: All images featured in this newsletter have been sourced directly from the research papers mentioned in this edition.