acm - an acm publication

Articles

A Conversation with Behnaz Arzani: Shaping the future of network management

Ubiquity, Volume 2024 Issue September, September 2024 | BY Bushra Anjum


Full citation in the ACM Digital Library  | PDF  | m4a


Ubiquity

Volume 2024, Number September (2024), Pages 1-5

Innovation Leaders: A Conversation with Behnaz Arzani: Shaping the future of network management
Bushra Anjum
DOI: 10.1145/3696408

In this interview, Ubiquity's senior editor, Bushra Anjum, chats with Behnaz Arzani, a principal researcher at Microsoft Research, about her pioneering work in automated network management. Arzani discusses the challenges of managing large-scale networks, particularly during incidents, and explores the role of AI and heuristics in addressing these challenges. The conversation then moves to her recently open-sourced project, MetaOpt, which helps network operators analyze and improve the impact of various network algorithms in use.

Behnaz Arzani is a principal researcher at Microsoft Research. She researches automated network management, focusing on practical solutions that safely introduce automation into the network management process. Her most recent work includes Soroush (a faster max-min fair resource allocation algorithm that is currently deployed in Azure's wide-area network controller) and MetaOpt (an open-source heuristic analyzer). Arzani joined Microsoft Research as a senior researcher in 2019 after a two-year post-doctoral position. She graduated from the University of Pennsylvania in 2017. She can be reached at bearzaniATmicrosoftDOTcom.

What are the current challenges and future directions in network incident management, especially with the potential integration of AI technologies like LLMs?

Networks are the backbone of nearly all of our technologies today. Users expect these networks to be functional 24/7 with (hopefully) no disruption. There was a funny incident in 2014 when Facebook had an outage, and people started to call 911! A tweet from retired Sargent Brink read: "Facebook is not a law enforcement issue, please do not call us about it being down, we don't know when FB will be back up!"

What users do not see is the amount of work that goes into ensuring the network remains functional at all times. We are moving toward an era where our networks are growing in scale and complexity. The incident management process is time-consuming, exhausting, and stressful for the on-call engineer because the set of possible root causes and monitoring data is large. Additionally, network managers must constantly monitor traffic, optimize routing, allocate resources, and ensure the security of the network. These tasks are often performed under pressure, especially during incidents or outages. What makes this even more challenging is that often we cannot deploy optimal solutions in a production environment: They are too slow and inefficient. Because of this, operators resort to using heuristics (approximate algorithms) but then these heuristics may underperform in certain cases and cause the network to become unreliable.

One may think, wouldn't the recent advances in AI, especially large language models (LLMs) like GPT, allow us to solve this problem? And the answer is yes, they can help but, as we talk about in our paper "A Holistic View of AI-driven Network Incident Management," we need to think carefully about where and how to use them, how to ensure the risk they introduce is minimal, and how to recover when they fail or when they become unavailable. We should build a solution that applies the LLM in a "chain-of-thought" process so that it can find the root cause of complex incidents—where there is not any incident with similar symptoms in the LLM training data. We need a well-defined set of primitives for the LLM to operate over as it mitigates an incident and builds algorithms to quantify their risk to the network to ensure the network is safe at all times. Perhaps our most important observation is that a human operator should be able to interact with the LLM and influence the mitigation process to avoid unsafe or incorrect actions.

So, for me, there is a long-term and a short-term plan. In the short term, my goal is to invent solutions that allow human operators to effectively and efficiently monitor their networks and to evaluate the risk, safety, and performance properties of their network to minimize outages and resolve them more quickly when they happen. For example, we have open-sourced a tool, MetaOpt (Github), that enables operators to analyze when and how heuristics (approximate algorithms that sacrifice optimality for speed and efficiency) they deploy in production may underperform so that they can appropriately mitigate the impact in such cases.

In the longer term, I plan to devise a process where we introduce automation (including automation that leverages AI) to gradually replace the human components in the network management workflow—by that point, the tools we build in the short term allow us to ensure the system continues to function reliably and with little additional risk.

How have your personal experiences, passions, and academic background shaped your interest in network management and influenced the solutions you've developed?

The urgency of this problem became clear during my internship at Azure Networking in 2015. I saw the amount of effort that went into the day-to-day management of large-scale networks first-hand and, more importantly, the toll it took on the engineers—long hours, high turnover, and constant stress. I have always wanted to make a meaningful difference in people's lives, and this felt like a perfect opportunity to use my research in a way that does that to some degree. It planted the seed for what would become my long-term vision: A gradual shift toward fully automated network management.

As a kid, I loved (more like, was obsessed with) Anne of Green Gables, and it is in part because the character was kind of like me. She was imaginative [and] creative, and she didn't put any boundaries on what she could or couldn't do. I think this type of personality is really what enabled me to imagine how we may be able to gradually move towards fully automated network management and maybe not worry about all the challenges along the way (and not get paralyzed by the enormity of those challenges).

While my Ph.D. is in computer science, my foundational training lies in electrical engineering. Rooted in communication theory, probability, and mathematics, this background significantly influences my approach to network management problems. For example, we have modeled the MetaOpt problem as a leader-follower game; theoretical concepts I learned while studying electrical engineering as part of the game theory class. The leader controls the inputs to the heuristic and optimal algorithm and tries to maximize the gap between the performance of the two, while the followers (the heuristic and the optimal algorithm) try to maximize their own objectives given this input. This model gives us a structure to solve the problem, which is very useful because it goes beyond the system that we built and allows us to devise ways to extend and improve it based on the work in this space. But we needed a way to solve this game in a scalable way, and this is again where my electrical engineering background helped: we modeled the problem as a bi-level optimization and automated and scaled the solution using techniques from optimization theory.

Could you elaborate on your current research focus, particularly in relation to MetaOpt and its potential to improve network management?

As we talk with operators and refine the architecture of our long-term vision (which includes trials with hypothetical outcomes where operators provide feedback on the outcomes automation may choose), our current focus is mostly on the tools we need to enable this vision.

As I mentioned earlier, it is important to enable human operators to safely manage their networks and to reason about the actions they take and the algorithms they deploy in their networks. Such solutions also allow us, in the long term, to ensure the automation that we introduce does not increase the operational risk of the network.

This is what my current research is focused on i.e., ensuring the safety and reliability of networks, both in the present with human operators and in the future with increased automation. For example, we recently open-source MetaOpt, which allows operators to analyze the heuristic algorithms they deploy in production. I mentioned earlier how MetaOpt allows operators to better understand why the approximate algorithms (heuristics) they use in practice may cause performance problems. For example, we show a heuristic many operators use to place virtual machines (VMs) may use two times more servers compared to the optimal solution. Through MetaOpt, operators can find out when such performance may happen. One of the current research focuses on algorithms that also show the operator why these performance problems happen.

The way I think of MetaOpt is through an analogy: Network verification tools based on satisfiability modulo theory (SMT) solvers allow operators to reason about the "correctness" of the configurations they push to the network; MetaOpt allows them to analyze the performance and ensure the algorithms they deploy do not cause catastrophic performance related failures.

MetaOpt, in its current form, does not apply to every heuristic but is limited to those that we can efficiently solve through our bi-level optimization model (which already covers a broad range of heuristics in packet scheduling, traffic engineering, bin-packing, and many more). Another project we are working on is one that expands MetaOpt's scope of applicability to those algorithms where this is not the case. This is where the foundation we built MetaOpt on proves useful, i.e., there are many ways to analyze and solve a leader-follower game of which bi-level optimization is just one example. We are further devising tools that make it easier for operators to use MetaOpt.

To further help incident mitigation, I have also built AI-based systems (Scouts and NetPoirot), which help operators quickly identify which system may be at fault when an incident happens. You can learn more about my approach to research through the Microsoft Research podcasts here and here.

If you are interested in network management and have ideas on how we can enable operators to automate the network management process; if you have a networking or systems heuristic and want to understand its performance gap, are interested in helping us improve MetaOpt and have ideas on how to enable it to apply to a broader range of heuristics; or if you are just passionate about the future of network management and want to brainstorm ideas, please get in touch.

Author

Bushra Anjum, Ph.D., serves as the Head of Data Science and AI/LLM subject matter expert at the EdTech startup NoRedInk. In this role, she leads a team of analysts, scientists, and engineers to develop adaptive online curriculum tools designed to enhance writing and critical thinking skills for students in grades 3–12. Dr. Anjum's expertise lies in statistical analysis, predictive modeling, GenAI tooling, and distributed systems engineering. She holds a Ph.D. in computer science from North Carolina State University.

cacm_ccby.gif This work is licensed under a Creative Commons Attribution International 4.0 License. Copyright 2024 is held by owner/author.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2024 ACM, Inc.

COMMENTS

POST A COMMENT
Leave this field empty