In this presentation, the core AI Red Team at Meta will take you on a journey through the story of Red Teaming the Llama 3 Large Language Model. This talk is perfect for anyone eager to delve into the complexity of advanced model Red Teaming and safety, as well as how to perform their own research to find new attacks should attend this talk. We’ll begin by exploring what AI Red Teaming is truly about, before exploring Meta’s process and approaches on the topic. The team will detail our methodology for discovering new risks within complex AI capabilities, how emergent capabilities may breed emergent risks, what types of attacks we’re looking to perform across different model capabilities and how or why the attacks even work. Moreover, we’ll explore insights into which lessons from decades of security expertise can – and cannot – be applied as we venture into a new era of AI trust and safety. The team will then move on to how we used automation to scale attacks up, our novel approach to multi-turn adversarial AI agents and the systems we built to benchmark safety across a set of different high-risk areas. We also plan to discuss advanced cyber-attacks (both human and automated), Meta’s open benchmark CyberSecEvals and touch on Red Teaming for national security threats presented by state-of-the-art models. For each of these areas we’ll touch on various assessment and measurement challenges, ending on where we see the AI Red Teaming industry gaps, as well as where AI Safety is heading at a rapid pace.
Get notified about new features and conference additions.