Advances in information extraction have enabled the automatic construction of large knowledge graphs (KGs) like DBpedia, YAGO, Wikidata or Google Knowledge Graph. Learning rules from KGs is a crucial task for KG completion, cleaning and curation. This tutorial presents state-of-the-art rule induction methods, recent advances, research opportunities as well as open challenges along this avenue. We put a particular emphasis on the problems of learning exception-enriched and numerical rules from highly biased and incomplete data. Finally, we discuss possible extensions of classical rule induction techniques to account for unstructured resources (e.g., text) along with the structured ones.
One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present an introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. We’ll present a high-level overview of the 8 layers of the ML stack: data, compute, versioning, orchestration, software architecture, model operations, feature engineering, and model development. We’ll present a schematic as to which layers data scientists need to be thinking about and working with, and then introduce attendees to the tooling and workflow landscape. In doing so, we’ll present a widely applicable stack that provides the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure.
The operation and maintenance of large scale production machine learning systems has uncovered new challenges which have required fundamentally different approaches to that of traditional software. The area of security in MLOps has seen a rise in attention as machine learning infrastructure expands to further critical usecases across industry. In this talk we introduce the conceptual and practical topics around MLSecOps that data science practitioners will be able to adopt or advocate for. We will also provide an intuition on key security challenges that arise in production machine learning systems as well as best practices and frameworks that can be adopted to help mitigate security risks in ML models, ML pipelines and ML services. We will cover a practical example showing how we can secure a machine learning model, and showcasing the security risks and best practices that can be adopted during the feature engineering, model training, model deployment and model monitoring stages of the machine learning lifecycle.
Natural language processing (NLP) applications such as chat bots, machine translation systems, text summarization systems, information extraction system etc. have seen significant performance boosts over the last decade, thanks to accurate methods for representing texts such as using large scale language models (e.g. BERT, GPT-3, RoBERTa etc.). However, social biases such as gender, racial and ethnic biases have been also identified in text representations produced by these large scale masked language models. It is problematic to use such biased language models in real-world NLP systems, interacted by millions of users world-wide on a daily basis because social biases encoded in the text representations propagate into those systems, and make unfair discriminatory decisions/responses. In this talk, I will first describe methods developed in the NLP community to detect the types and levels of social biases learnt by large-scale language models. Next, I will present techniques that can be used to mitigate such biases.
Finding metrics that describe performance can unlock valuable insights in the field of Data Science. It can be helpful to visualize the distribution of these metrics and to understand how segments of metrics vary with each other. It is needed here to define categories such as age or gender that can divide the data, which is a limitation of segmenting analytics. Vector Search uses semantics to analyze data, and does not have the limitation of requiring symbolic tags. In this talk, you will learn how to use Vector Search as a Data Scientist. By means of real Youtube and Twitter data, you’ll see how easy it is to utilize this yourself with the Vector Search engine Weaviate.
There is a pressing need for tools and workflows that meet data scientists where they are. This is also a serious business need: How to enable an organization of data scientists, who are not software engineers by training, to build and deploy end-to-end machine learning workflows and applications independently. In this talk, we discuss the problem space and the approach we took to solving it with Metaflow, the open-source framework we developed at Netflix, which now powers hundreds of business-critical ML projects at Netflix and other companies from bioinformatics and drones to real estate. We wanted to provide the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure: data, compute, orchestration, and versioning. In this talk, you will learn about: * What to expect from a modern ML infrastructure stack. * Using tools such as Metaflow to boost the productivity of your data science organization, based on lessons learned from Netflix and many other companies. *Deployment strategies for a full stack of ML infrastructure that plays nicely with your existing systems and policies.
WSL 2 - Windows Subsystem for Linux is a layer for running Linux binary executables natively on Windows. What is WSL 2? How does it fit within your workflow? What is the value of it for data science? How to setup your machine? How to run your first code? This introductory session aims to provide answers to these questions, get you introduced to WSL2 and get you started by configuring your machine and running your first code.
Digital Twins, like the word Artificial Intelligence, is being used to mean very different things. We define the spectrum of different uses of the word from a digital data twin to a more sophisticated ecosystem of cognitive adaptive twins. We trace the history of digital twins and its roots in agent-based simulation and how it is merging with advances in machine learning (ML) and other areas of AI to morph into ‘simulation intelligence’. We describe several examples of digital twins in the transportation, banking, and healthcare sectors. Like ML models, simulation models are also being deployed across the enterprise and co-exist with other software and AI/ML models. A ten-step approach to building and deploying such models will be discussed. The talk will help business leaders to understand the benefits of digital twins while providing technology leaders the key skills, capabilities and tools required to design, build, and deploy agent-based simulations.
In the field of healthcare, AI can provide solutions as well as be source of bias and therefore inequity. Bias can creep in via algorithmic processes or be inherent in the underlying data. This talk will introduce the audience to challenges in AI for health equity with a particular focus on race and ethnicity data. We will explore real-world ethnicity data collected routinely in healthcare settings in the form of electronic health records. We will examine issues with completeness, correctness, and granularity of these data, implications for healthcare AI, and finally highlight opportunities towards “better data, better models, better healthcare”
In this talk, Nuria will describe the work that she did between March 2020 and April 2022, leading a multi-disciplinary team of 20+ volunteer scientists working very closely with the Presidency of the Valencian Government in Spain on 4 large areas: human mobility modeling; computational epidemiological models (both metapopulation, individual and LSTM-based models); predictive models; and a large-scale, online citizen surveys called the COVID19impactsurvey (https://covid19impactsurvey.org) with over 720,000 answers worldwide. This survey has enabled us to shed light on the impact that the pandemic is having on people's lives. She will present the results obtained in each of these four areas, including winning the 500K XPRIZE Pandemic Response Challenge and obtaining a best paper award at ECML-PKDD 2021. She will share the lessons learned in this very special initiative of collaboration between the civil society at large (through the survey), the scientific community (through the Expert Group) and a public administration (through the Commissioner at the Presidency level).