Keywords: Deploying machine learning systems Monitoring machine learning systems Continual delivery
Until recently, Machine Learning has been mostly applied in industry by consulting academics, data scientists within larger companies, and a number of dedicated Machine Learning research labs within a few of the world’s most innovative tech companies. Over the last few years we have seen the dramatic rise of companies dedicated to providing Machine Learning software-as-a-service tools, with the aim of democratizing access to the benefits of Machine Learning. All these efforts have revealed major hurdles to ensuring the continual delivery of good performance from deployed Machine Learning systems. These hurdles range from challenges in MLOps, to fundamental problems with deploying certain algorithms, to solving the legal issues surrounding the ethics involved in letting algorithms make decisions for your business.
This workshop will invite papers related to the challenges in deploying and monitoring ML systems. It will encourage submission on: subjects related to MLOps for deployed ML systems (such as testing ML systems, debugging ML systems, monitoring ML systems, debugging ML Models, deploying ML at scale); subjects related to the ethics around deploying ML systems (such as ensuring fairness, trust and transparency of ML systems, providing privacy and security on ML Systems); useful tools and programming languages for deploying ML systems; specific challenges relating to deploying reinforcement learning in ML systems
and performing continual learning and providing continual delivery in ML systems;
and finally data challenges for deployed ML systems.
Fri 5:00 a.m. - 5:10 a.m.
|
Opening remarks
(
Talk
)
|
Alessandra Tosi · Nathan Korda 🔗 |
Fri 5:10 a.m. - 5:55 a.m.
|
Deploying Machine Learning Models in a Developing Country
(
Invited talk
)
Successful deployment of ML models tends to result from a good fit of the technology and the context. In this talk I will focus on the African context which is synonymous with developing context but I want to argue there is a difference. I will expound on the opportunities and challenges that this unique context provides and the assumptions made in deploying in such a context and how well they fit. Another angle of the talk will be on deployment with a view to influence societal good which may be different from deployment in a production system. I will also draw insights from some projects I have been engaged in towards this end.
|
Ernest Mwebaze 🔗 |
Fri 5:55 a.m. - 6:40 a.m.
|
System-wide Monitoring Architectures with Explanations
(
Invited talk
)
link »
SlidesLive Video » I present a new architecture for detecting and explaining complex system failures. My contribution is a system-wide monitoring architecture, which is composed of introspective, overlapping committees of subsystems. Each subsystem is encapsulated in a "reasonableness" monitor, an adaptable framework that supplements local decisions with commonsense data and reasonableness rules. This framework is dynamic and introspective: it allows each subsystem to defend its decisions in different contexts--to the committees it participates in and to itself. For reconciling system-wide errors, I developed a comprehensive architecture that I call "Anomaly Detection through Explanations" (ADE). The ADE architecture contributes an explanation synthesizer that produces an argument tree, which in turn can be traced and queried to determine the support of a decision, and to construct counterfactual explanations. I have applied this methodology to detect incorrect labels in semi-autonomous vehicle data, and to reconcile inconsistencies in simulated anomalous driving scenarios. In conclusion, I discuss the difficulties in /evaluating/ these types of monitoring systems. I argue that meaningful evaluation tasks should be dynamic: designing collaborative tasks (between a human and machine) that require /explanations/ for success. |
Leilani Gilpin 🔗 |
Fri 6:40 a.m. - 6:50 a.m.
|
First Break
|
🔗 |
Fri 6:50 a.m. - 7:20 a.m.
|
Bridging the gap between research and production in machine learning
(
Invited talk
)
Machine learning has found increasing use in the real world, and yet a framework for productionizing machine learning algorithms is lacking. This talk discusses how companies can bridge the gap between research and production in machine learning. It starts with the key differences between the research and production environments: data, goals, compute requirements, and evaluation metrics. It also breaks down the different phases of a machine learning production cycle, the infrastructure currently available for the process, and the industry best practices. Live presentation |
Huyen Nguyen 🔗 |
Fri 7:20 a.m. - 7:30 a.m.
|
Monitoring and explainability of models in production
(
Contributed talk
)
link »
SlidesLive Video » The machine learning lifecycle extends beyond the deployment stage. Monitoring deployed models is crucial for continued provision of high quality machine learning enabled services. Key areas include model performance and data monitoring, detecting outliers and data drift using statistical techniques, and providing explanations of historic predictions. We discuss the challenges to successful implementation of solutions in each of these areas with some recent examples of production ready solutions using open source tools. |
Janis Klaise 🔗 |
Fri 7:30 a.m. - 7:40 a.m.
|
Gradient-Based Monitoring of Learning Machines
(
Contributed talk
)
link »
SlidesLive Video » The widespread use of machine learning algorithms calls for automatic change detection algorithms to monitor their behaviour over time. As a machine learning algorithm learns from a continuous, possibly evolving, stream of data, it is desirable and often critical to supplement it with a companion change detection algorithm to facilitate its monitoring and control. We present a generic score-based change detection method that can detect a change in any number of (hidden) components of a machine learning model trained via empirical risk minimization. This proposed statistical hypothesis test can be readily implemented for such models designed within a differentiable programming framework. We establish the consistency of the hypothesis test and show how to calibrate it based on our theoretical results. We illustrate the versatility of the approach on synthetic and real data. |
Lang Liu 🔗 |
Fri 7:40 a.m. - 7:50 a.m.
|
Not Your Grandfather's Test Set: Reducing Labeling Effort for Testing
(
Contributed talk
)
link »
SlidesLive Video » Building and maintaining high-quality test sets remains a laborious and expensive task. As a result, test sets in the real world are often not properly kept up to date and drift from the production traffic they are supposed to represent. The frequency and severity of this drift raise serious concerns over the value of manually labelled test sets in the QA process. This paper proposes a simple but effective technique that drastically reduces the effort needed to construct and maintain a high-quality test set (reducing labelling effort by 80-100% across a range of practical scenarios). This result encourages a fundamental rethinking of the testing process by both practitioners, who can use these techniques immediately to improve their testing and researchers who can help address many of the open questions raised by this new approach. |
Begum Taskazan 🔗 |
Fri 7:50 a.m. - 8:00 a.m.
|
Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models
(
Contributed talk
)
link »
SlidesLive Video » Deep learning (DL) can achieve impressive results across a wide variety of tasks, but this often comes at the cost of training models for extensive periods on specialized hardware accelerators. This energy-intensive workload has seen immense growth in recent years. Machine learning (ML) may become a significant contributor to climate change if this exponential trend continues. If practitioners are aware of their energy and carbon footprint, then they may actively take steps to reduce it whenever possible. In this work, we present carbontracker, a tool for tracking and predicting the energy and carbon footprint of training DL models. We propose that energy and carbon footprint of model development and training is reported alongside performance metrics using tools like carbontracker. We hope this will promote responsible computing in ML and encourage research into energy-efficient deep neural networks. |
Lasse F. Wolff Anthony 🔗 |
Fri 8:00 a.m. - 8:10 a.m.
|
Serverless inferencing on Kubernetes
(
Contributed talk
)
link »
SlidesLive Video » Organisations are increasingly putting machine learning models into production at scale. The increasing popularity of serverless scale-to-zero paradigms presents an opportunity for deploying machine learning models to help mitigate infrastructure costs when many models may not be in continuous use. We will discuss the KFServing project which builds on the KNative serverless paradigm to provide a serverless machine learning inference solution that allows a consistent and simple interface for data scientists to deploy their models. We will show how it solves the challenges of autoscaling GPU based inference and discuss some of the lessons learnt from using it in production. |
Clive Cox 🔗 |
Fri 8:10 a.m. - 8:20 a.m.
|
Do You Sign Your Model?
(
Contributed talk
)
link »
SlidesLive Video » Engineering a top-notch deep neural network (DNN) is an expensive procedure which involves collecting data, hiring human resources with expertise in machine learning, and providing high computational resources. For that reason, DNNs are considered as valuable Intellectual Properties (IPs) of the model vendors. To ensure a reliable commercialization of these products, it is crucial to develop techniques to protect model vendors against IP infringements. One of such techniques that recently has shown great promise is digital watermarking. In this paper, we present GradSigns, a novel watermarking framework for DNNs. GradSigns embeds owner's signature into gradient of cross-entropy cost function with respect to inputs to the model. Our approach has negligible impact on the performance of the protected model, and can verify ownership of remotely deployed models through prediction APIs. We evaluate GradSigns on DNNs trained for different image classification tasks using CIFAR-10, SVHN and YTF datasets, and experimentally show that unlike existing methods, GradSigns is robust against counter-watermark attacks, and can embed large amount of information into DNNs. |
Omid Aramoon 🔗 |
Fri 8:20 a.m. - 8:30 a.m.
|
PareCO: Pareto-aware Channel Optimization for Slimmable Neural Networks
(
Contributed talk
)
link »
SlidesLive Video » Slimmable neural networks have been proposed recently for resource-constrained settings such as mobile devices as they provide a flexible trade-off front between prediction error and computational cost (such as the number of floating-point operations or FLOPs) with the same storage cost as a single model. However, current slimmable neural networks use a single width-multiplier for all the layers to arrive at sub-networks with different performance profiles, which neglects that different layers affect the network's prediction accuracy differently and have different FLOP requirements. We formulate the problem of optimizing slimmable networks from a multi-objective optimization lens, which leads to a novel algorithm for optimizing both the shared weights and the width-multipliers for the sub-networks. While slimmable neural networks introduce the possibility of only maintaining a single model instead of many, our results make it more realistic to do so by improving their performance. |
Ting-wu Chin 🔗 |
Fri 8:30 a.m. - 8:40 a.m.
|
Technology Readiness Levels for Machine Learning Systems
(
Contributed talk
)
link »
SlidesLive Video » The development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. The extreme is spacecraft systems, where mission critical measures and robustness are ingrained in the development process. Drawing on experience in both spacecraft engineering and AI/ML (from research through product), we propose a proven systems engineering approach for machine learning development and deployment. Our "Technology Readiness Levels for ML" (TRL4ML) framework defines a principled process to ensure robust systems while being streamlined for ML research and product, including key distinctions from traditional software engineering. Even more, TRL4ML defines a common language for people across the organization to work collaboratively on ML technologies. |
Alexander Lavin 🔗 |
Fri 8:40 a.m. - 9:20 a.m.
|
Poster session
Q/A live session for the contributed talks that have been played in the previous session. Each poster presenter is in a separate Zoom Meeting.
|
Janis Klaise · Lang Liu · Begum Taskazan · Lasse F. Wolff Anthony · Clive Cox · Omid Aramoon · Ting-wu Chin · Alexander Lavin 🔗 |
Fri 9:20 a.m. - 9:30 a.m.
|
Second Break
|
🔗 |
Fri 9:30 a.m. - 10:30 a.m.
|
Open Problems Panel
(
Panel
)
In this panel, we will present and discuss six open problems. The abstracts are available at the workshop's webpage: https://sites.google.com/view/deploymonitormlsystems/open-problems
|
Alessandra Tosi · Nathan Korda · Yuzhui Liu · Zhenwen Dai · Zhenwen Dai · Alexander Lavin · Erick Galinkin · Camylle Lanteigne 🔗 |
Fri 10:30 a.m. - 10:40 a.m.
|
Third break
|
🔗 |
Fri 10:40 a.m. - 11:50 a.m.
|
Conservative Exploration in Bandits and Reinforcement Learning
(
Invited talk
)
link »
SlidesLive Video » A major challenge in deploying machine learning algorithms for decision-making problems is the lack of guarantee for the performance of their resulting policies, especially those generated during the initial exploratory phase of these algorithms. Online decision-making algorithms, such as those in bandits and reinforcement learning (RL), learn a policy while interacting with the real system. Although these algorithms will eventually learn a good or an optimal policy, there is no guarantee for the performance of their intermediate policies, especially at the very beginning, when they perform a large amount of exploration. Thus, in order to increase their applicability, it is important to control their exploration and to make it more conservative. To address this issue, we define a notion of safety that we refer to as safety w.r.t. a baseline. In this definition, a policy considered to be safe if it performs at least as well as a baseline, which is usually the current strategy of the company. We formulate this notion of safety in bandits and RL and show how it can be integrated into these algorithms as a constraint that must be satisfied uniformly in time. We derive contextual linear bandits and RL algorithms that minimize their regret, while ensure that at any given time, their expected sum of rewards remains above a fixed percentage of the expected sum of rewards of the baseline policy. This fixed percentage depends on the amount of risk that the manager of the system is willing to take. We prove regret bounds for our algorithms and show that the cost of satisfying the constraint (conservative exploration) can be controlled. Finally, we report experimental results to validate our theoretical analysis. We conclude the talk by discussing a few other constrained bandit formulations. |
Mohammad Ghavamzadeh 🔗 |
Fri 11:50 a.m. - 12:30 p.m.
|
Successful Data Science in Production Systems: It’s All About Assumptions
(
Invited talk
)
link »
SlidesLive Video » We explore the art of identifying and verifying assumptions as we build and deploy data science algorithms into production systems. These assumptions can take many forms, from the typical “have we properly specified the objective function?” to the much thornier “does my partner in engineering understand what data I need audited?”. Attendees from outside industry will get a glimpse of the complications that arise when we fail to tend to assumptions in deploying data science in production systems; those on the inside will walk away with some practical tools to increase the chances of successful deployment from day one. |
Nevena Lalic 🔗 |
Fri 12:30 p.m. - 1:30 p.m.
|
Panel discussion
(
Panel
)
All keynote speakers are invited to this panel to discuss the main challenges in deploying and monitoring machine learning systems. Chair: Neil D. Lawrence |
Neil Lawrence · Mohammad Ghavamzadeh · Leilani Gilpin · Huyen Nguyen · Ernest Mwebaze · Nevena Lalic 🔗 |