Interpretability Meets Fairness: Where we at, where we going


Interpretability, accountability, and control are among today’s biggest challenges with modern large-scale models. Even though recent technological advances (e.g., scale, multiple training sources, generation via prompting, and in-context learning) counter transparency, a large body of research has explored how, what, and when it is possible to attribute specific (undesired) behaviors to models. This talk explores recent breakthroughs in interpretability techniques designed to ensure fairness and safety in language models. We will discuss methods for identifying biases and vulnerabilities, enabling us to create more equitable and trustworthy AI systems. The presentation will also address challenges and trade-offs associated with interpretability, paving the way for safer and more inclusive models.

Aug 29, 2023 12:00 AM
Data Science @UHH Retreat
Malente, Germany