Timezone: »

Challenges in Deploying and Monitoring Machine Learning Systems
Alessandra Tosi · Nathan Korda · Neil Lawrence

Fri Jul 17 05:00 AM -- 01:35 PM (PDT) @ None
Event URL: https://sites.google.com/view/deploymonitormlsystems »

Until recently, Machine Learning has been mostly applied in industry by consulting academics, data scientists within larger companies, and a number of dedicated Machine Learning research labs within a few of the world’s most innovative tech companies. Over the last few years we have seen the dramatic rise of companies dedicated to providing Machine Learning software-as-a-service tools, with the aim of democratizing access to the benefits of Machine Learning. All these efforts have revealed major hurdles to ensuring the continual delivery of good performance from deployed Machine Learning systems. These hurdles range from challenges in MLOps, to fundamental problems with deploying certain algorithms, to solving the legal issues surrounding the ethics involved in letting algorithms make decisions for your business.

This workshop will invite papers related to the challenges in deploying and monitoring ML systems. It will encourage submission on: subjects related to MLOps for deployed ML systems (such as testing ML systems, debugging ML systems, monitoring ML systems, debugging ML Models, deploying ML at scale); subjects related to the ethics around deploying ML systems (such as ensuring fairness, trust and transparency of ML systems, providing privacy and security on ML Systems); useful tools and programming languages for deploying ML systems; specific challenges relating to deploying reinforcement learning in ML systems
and performing continual learning and providing continual delivery in ML systems;
and finally data challenges for deployed ML systems.

Fri 5:00 a.m. - 5:10 a.m. [iCal]
Opening remarks (Talk)
Alessandra Tosi, Nathan Korda
Fri 5:10 a.m. - 5:55 a.m. [iCal]

Successful deployment of ML models tends to result from a good fit of the technology and the context. In this talk I will focus on the African context which is synonymous with developing context but I want to argue there is a difference. I will expound on the opportunities and challenges that this unique context provides and the assumptions made in deploying in such a context and how well they fit. Another angle of the talk will be on deployment with a view to influence societal good which may be different from deployment in a production system. I will also draw insights from some projects I have been engaged in towards this end.

Live presentation

Ernest Mwebaze
Fri 5:55 a.m. - 6:40 a.m. [iCal]

I present a new architecture for detecting and explaining complex system failures. My contribution is a system-wide monitoring architecture, which is composed of introspective, overlapping committees of subsystems. Each subsystem is encapsulated in a "reasonableness" monitor, an adaptable framework that supplements local decisions with commonsense data and reasonableness rules. This framework is dynamic and introspective: it allows each subsystem to defend its decisions in different contexts--to the committees it participates in and to itself.

For reconciling system-wide errors, I developed a comprehensive architecture that I call "Anomaly Detection through Explanations" (ADE). The ADE architecture contributes an explanation synthesizer that produces an argument tree, which in turn can be traced and queried to determine the support of a decision, and to construct counterfactual explanations. I have applied this methodology to detect incorrect labels in semi-autonomous vehicle data, and to reconcile inconsistencies in simulated anomalous driving scenarios.

In conclusion, I discuss the difficulties in /evaluating/ these types of monitoring systems. I argue that meaningful evaluation tasks should be dynamic: designing collaborative tasks (between a human and machine) that require /explanations/ for success.

Leilani Gilpin
Fri 6:40 a.m. - 6:50 a.m. [iCal]
First Break (Break)
Fri 6:50 a.m. - 7:20 a.m. [iCal]

Machine learning has found increasing use in the real world, and yet a framework for productionizing machine learning algorithms is lacking. This talk discusses how companies can bridge the gap between research and production in machine learning. It starts with the key differences between the research and production environments: data, goals, compute requirements, and evaluation metrics. It also breaks down the different phases of a machine learning production cycle, the infrastructure currently available for the process, and the industry best practices.

Live presentation

Chip Nguyen
Fri 7:20 a.m. - 7:30 a.m. [iCal]

The machine learning lifecycle extends beyond the deployment stage. Monitoring deployed models is crucial for continued provision of high quality machine learning enabled services. Key areas include model performance and data monitoring, detecting outliers and data drift using statistical techniques, and providing explanations of historic predictions. We discuss the challenges to successful implementation of solutions in each of these areas with some recent examples of production ready solutions using open source tools.

Janis Klaise
Fri 7:30 a.m. - 7:40 a.m. [iCal]

The widespread use of machine learning algorithms calls for automatic change detection algorithms to monitor their behaviour over time. As a machine learning algorithm learns from a continuous, possibly evolving, stream of data, it is desirable and often critical to supplement it with a companion change detection algorithm to facilitate its monitoring and control. We present a generic score-based change detection method that can detect a change in any number of (hidden) components of a machine learning model trained via empirical risk minimization. This proposed statistical hypothesis test can be readily implemented for such models designed within a differentiable programming framework. We establish the consistency of the hypothesis test and show how to calibrate it based on our theoretical results. We illustrate the versatility of the approach on synthetic and real data.

Lang Liu
Fri 7:40 a.m. - 7:50 a.m. [iCal]

Building and maintaining high-quality test sets remains a laborious and expensive task. As a result, test sets in the real world are often not properly kept up to date and drift from the production traffic they are supposed to represent. The frequency and severity of this drift raise serious concerns over the value of manually labelled test sets in the QA process. This paper proposes a simple but effective technique that drastically reduces the effort needed to construct and maintain a high-quality test set (reducing labelling effort by 80-100% across a range of practical scenarios). This result encourages a fundamental rethinking of the testing process by both practitioners, who can use these techniques immediately to improve their testing and researchers who can help address many of the open questions raised by this new approach.

Begum Taskazan
Fri 7:50 a.m. - 8:00 a.m. [iCal]

Deep learning (DL) can achieve impressive results across a wide variety of tasks, but this often comes at the cost of training models for extensive periods on specialized hardware accelerators. This energy-intensive workload has seen immense growth in recent years. Machine learning (ML) may become a significant contributor to climate change if this exponential trend continues. If practitioners are aware of their energy and carbon footprint, then they may actively take steps to reduce it whenever possible. In this work, we present carbontracker, a tool for tracking and predicting the energy and carbon footprint of training DL models. We propose that energy and carbon footprint of model development and training is reported alongside performance metrics using tools like carbontracker. We hope this will promote responsible computing in ML and encourage research into energy-efficient deep neural networks.

Lasse F. Wolff Anthony
Fri 8:00 a.m. - 8:10 a.m. [iCal]

Organisations are increasingly putting machine learning models into production at scale. The increasing popularity of serverless scale-to-zero paradigms presents an opportunity for deploying machine learning models to help mitigate infrastructure costs when many models may not be in continuous use. We will discuss the KFServing project which builds on the KNative serverless paradigm to provide a serverless machine learning inference solution that allows a consistent and simple interface for data scientists to deploy their models. We will show how it solves the challenges of autoscaling GPU based inference and discuss some of the lessons learnt from using it in production.

Clive Cox
Fri 8:10 a.m. - 8:20 a.m. [iCal]

Engineering a top-notch deep neural network (DNN) is an expensive procedure which involves collecting data, hiring human resources with expertise in machine learning, and providing high computational resources. For that reason, DNNs are considered as valuable Intellectual Properties (IPs) of the model vendors. To ensure a reliable commercialization of these products, it is crucial to develop techniques to protect model vendors against IP infringements. One of such techniques that recently has shown great promise is digital watermarking. In this paper, we present GradSigns, a novel watermarking framework for DNNs. GradSigns embeds owner's signature into gradient of cross-entropy cost function with respect to inputs to the model. Our approach has negligible impact on the performance of the protected model, and can verify ownership of remotely deployed models through prediction APIs. We evaluate GradSigns on DNNs trained for different image classification tasks using CIFAR-10, SVHN and YTF datasets, and experimentally show that unlike existing methods, GradSigns is robust against counter-watermark attacks, and can embed large amount of information into DNNs.

Omid Aramoon
Fri 8:20 a.m. - 8:30 a.m. [iCal]

Slimmable neural networks have been proposed recently for resource-constrained settings such as mobile devices as they provide a flexible trade-off front between prediction error and computational cost (such as the number of floating-point operations or FLOPs) with the same storage cost as a single model. However, current slimmable neural networks use a single width-multiplier for all the layers to arrive at sub-networks with different performance profiles, which neglects that different layers affect the network's prediction accuracy differently and have different FLOP requirements. We formulate the problem of optimizing slimmable networks from a multi-objective optimization lens, which leads to a novel algorithm for optimizing both the shared weights and the width-multipliers for the sub-networks. While slimmable neural networks introduce the possibility of only maintaining a single model instead of many, our results make it more realistic to do so by improving their performance.

Ting-wu Chin
Fri 8:30 a.m. - 8:40 a.m. [iCal]

The development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can lead to technical debt, scope creep and misaligned objectives, model misuse and failures, and expensive consequences. Engineering systems, on the other hand, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. The extreme is spacecraft systems, where mission critical measures and robustness are ingrained in the development process. Drawing on experience in both spacecraft engineering and AI/ML (from research through product), we propose a proven systems engineering approach for machine learning development and deployment. Our "Technology Readiness Levels for ML" (TRL4ML) framework defines a principled process to ensure robust systems while being streamlined for ML research and product, including key distinctions from traditional software engineering. Even more, TRL4ML defines a common language for people across the organization to work collaboratively on ML technologies.

Alexander Lavin
Fri 8:40 a.m. - 9:20 a.m. [iCal]

Q/A live session for the contributed talks that have been played in the previous session. Each poster presenter is in a separate Zoom Meeting.

  • "Monitoring and explainability of models in production", Klaise, Join Zoom

  • "Gradient-Based Monitoring of Learning Machines", Liu, Join Zoom

  • "Not Your Grandfather's Test Set: Reducing Labeling Effort for Testing", Taskazan, Join Zoom

  • "Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models", Anthony Join Zoom

  • "Serverless inferencing on Kubernetes", Cox, Join Zoom

  • "Do You Sign Your Model?", Aramoon, Join Zoom

  • "PareCO: Pareto-aware Channel Optimization for Slimmable Neural Networks", Chin, Join Zoom

  • "Technology Readiness Levels for Machine Learning Systems ", Lavin, Join Zoom

Janis Klaise, Lang Liu, Begum Taskazan, Lasse F. Wolff Anthony, Clive Cox, Omid Aramoon, Ting-wu Chin, Alexander Lavin
Fri 9:20 a.m. - 9:30 a.m. [iCal]
Second Break (Break)
Fri 9:30 a.m. - 10:30 a.m. [iCal]

In this panel, we will present and discuss six open problems. The abstracts are available at the workshop's webpage: https://sites.google.com/view/deploymonitormlsystems/open-problems

  • Yuzhui Liu (Bloomberg): Deploy machine learning models serverlessly at scale
  • Zhenwen Dai (Spotify): Model Selection for Production Systems
  • Alexander Lavin (Augustus Intelligence): ML lacks the formal processes and industry standards of other engineering disciplines.
  • Erick Galinkin (Montreal AI Ethics Institute and Rapid7): Green Lighting ML
  • Alexander Lavin (Augustus Intelligence): Approaches to AI ethics must consider second-order effects and downstream uses, but how?
  • Camylle Lanteigne (MAIEI and McGill University): SECure: A Social and Environmental Certificate for AI Systems
Alessandra Tosi, Nathan Korda, Yuzhui Liu, Zhenwen Dai, Zhenwen Dai, Alexander Lavin, Erick Galinkin, Camylle Lanteigne
Fri 10:30 a.m. - 10:40 a.m. [iCal]
Third break (Break)
Fri 10:40 a.m. - 11:50 a.m. [iCal]

A major challenge in deploying machine learning algorithms for decision-making problems is the lack of guarantee for the performance of their resulting policies, especially those generated during the initial exploratory phase of these algorithms. Online decision-making algorithms, such as those in bandits and reinforcement learning (RL), learn a policy while interacting with the real system. Although these algorithms will eventually learn a good or an optimal policy, there is no guarantee for the performance of their intermediate policies, especially at the very beginning, when they perform a large amount of exploration. Thus, in order to increase their applicability, it is important to control their exploration and to make it more conservative.

To address this issue, we define a notion of safety that we refer to as safety w.r.t. a baseline. In this definition, a policy considered to be safe if it performs at least as well as a baseline, which is usually the current strategy of the company. We formulate this notion of safety in bandits and RL and show how it can be integrated into these algorithms as a constraint that must be satisfied uniformly in time. We derive contextual linear bandits and RL algorithms that minimize their regret, while ensure that at any given time, their expected sum of rewards remains above a fixed percentage of the expected sum of rewards of the baseline policy. This fixed percentage depends on the amount of risk that the manager of the system is willing to take. We prove regret bounds for our algorithms and show that the cost of satisfying the constraint (conservative exploration) can be controlled. Finally, we report experimental results to validate our theoretical analysis. We conclude the talk by discussing a few other constrained bandit formulations.

Mohammad Ghavamzadeh
Fri 11:50 a.m. - 12:30 p.m. [iCal]

We explore the art of identifying and verifying assumptions as we build and deploy data science algorithms into production systems. These assumptions can take many forms, from the typical “have we properly specified the objective function?” to the much thornier “does my partner in engineering understand what data I need audited?”. Attendees from outside industry will get a glimpse of the complications that arise when we fail to tend to assumptions in deploying data science in production systems; those on the inside will walk away with some practical tools to increase the chances of successful deployment from day one.

Nevena Lalic
Fri 12:30 p.m. - 1:30 p.m. [iCal]

All keynote speakers are invited to this panel to discuss the main challenges in deploying and monitoring machine learning systems.

Chair: Neil D. Lawrence

Neil Lawrence, Mohammad Ghavamzadeh, Leilani Gilpin, Chip Nguyen, Ernest Mwebaze, Nevena Lalic

Author Information

Alessandra Tosi (Mind Foundry)
Nathan Korda (Mind Foundry)
Neil Lawrence (University of Cambridge)

Neil Lawrence is the DeepMind Professor of Machine Learning at the University of Cambridge and a Senior AI Fellow at the Alan Turing Institute.

More from the Same Authors