The Pitfalls of Machine Learning when Striving for AIOps

The Pitfalls of Machine Learning when Striving for AIOps

The Pitfalls of Machine Learning when Striving for AIOps

On the roadmap to Artificial Intelligence for IT operations (AIOps), organisations need to consider implementing a Machine Learning (ML) platform that can ingest and analyse large volumes of data from multiple sources.

ML platforms help detect patterns for predictive insights, thereby creating a self-healing system that enables better operational efficiencies.

After recently publishing a one-pager, intended as a roadmap for how to achieve AIOps (glossy), I thought it might be a good idea to shed some light on the ML stage and highlight some of the caveats that much of the market seems to have chosen to either ignore or underestimate.

As our roadmap implies, when implementing visibility systems, system automation can already be enabled in the earliest stages.  The first hurdle that most organisations face before reaching ML is making sure that the data, gleaned from multiple sources, is collated into a central data warehouse.

From the data warehouse, the data needs to be normalised so that the ML module can ingest the data.

Types of ML

Today, ML algorithms are trained using three prominent methods, namely: supervised-, unsupervised- and reinforcement learning.

In this article, we will focus on supervised learning. It is one of the most basic types of ML, yet, when used in the right circumstances, it is extremely powerful.

In this scenario, the ML algorithm is trained by a human, using past data (training dataset), with the desired output in mind.   The training dataset provides the algorithm with a basic idea of the problem, solution, and data points that will be dealt with.

The algorithm is then trained to find a cause-and-effect relationship between the parameters.  Once trained, the algorithm is deployed for use on the final dataset.  After deployment, the algorithm continues to improve, discovering new patterns and relationships.

An example of this would be weather prediction wherein you know what to predict (future weather) and can do this by training a ML algorithm using past data.

Now for the part that is largely overlooked:

The Human Role in ML

The name alone implies that there is some teaching that needs to happen.  Having algorithms in place to perform tasks, as illustrated above, implies that there is a human factor that teaches based on human experience.

There is no silver bullet for sidestepping the transfer of knowledge.  It’s like raising a child; there must be constant supervision and guidance so that the correct behaviours develop.

"Children should have enough freedom to be themselves - once they've learned the rules."

- Anna Quindlen.

Let’s illustrate using a few practical examples mapped to the categories (as listed above).

Baseline Deviation

Let’s use link utilisation as an example.  As illustrated below, this links data shows low utilisation as the norm.  The ML module automatically creates a critical event based on the size of the deviation and the rapid change.

Thereafter, it is up to us (humans) to acknowledge that this is a critical alarm and offer a resolution action.

If left unattended, the ML algorithms keep reviewing the data.  It downgrades the impact over time and then accepts the higher utilisation as the new normal, suppressing the event notification.

The inverse of this creates a (potential) false positive, wherein the high utilisation dropping too low creates a notification that also needs human teaching.

Predictive Modelling

A predictive model requires even more human involvement, as it implies that multiple data feeds need to be ingested; metrics reviewed and then suggested corrective actions confirmed to build a successful model.

It usually also implies that there is some online collaboration or workflow involved with data synchronisation enabled.

To simplify that into a practical example, let’s consider that a ML system user has upgraded an incident to a critical event.

The raw events may look like this:

WAN Link utilisation for branch JHB at 80%, Host JHBRouterLatency trigger for high latency (4500ms) between Webserver01 and DB07StorageArray

User, Stan Brainchild, event logged for slow responses, SystemHR (Upgraded to critical)

Redundant power supply failure, GautengDistributionSwitch01   

Additional enrichment information should link to the events above, specifically information relating to the highlighted items. For example:

Stan Brainchild, Asset: Dell LT, SN:123456789, OS:Win10, Location: PTA

GautengDistributionSwitch01, Asset: Cisco Switch, SN: 987654321, IOS: 15.4, Location: PTA

JHB, Co-ordinates 26.2041° S, 28.0473° E

The collaboration and workflow should provide additional data from the human inputs. So again, ignoring events will not teach the system.  That is, fixing the problem without updating the system does not allow for additional data to be used as input to the scenario.

If an engineer picked up on the latency issue between the web server and database, they might have input a resolution note: “Increased load on the front end (Webserver01) caused backend (DB07StorageArray) queues to build up.  Increased WorkQueueLength to 5000, error cleared.”    

If human consensus is reached, acknowledging that the root cause of the SystemHR slow responses was, in fact, the WorkQueueLength adjustment, then the ML needs to be taught that these events are linked.  For example:

User, Stan Brainchild, event logged for slow responses, SystemHR

Increased load on the front end (Webserver01) caused backend (DB07StorageArray) queues to build up. Increased WorkQueueLength to 5000, error cleared.

It’s a step in the right direction, yet only the beginning of teaching because the logic that is linked above is still very broad and only one scenario out of many.

To illustrate again, this is typically the next logic spawned (from the next incident that Stan logs):

User, Stan Brainchild, event logged for password reset, SystemHR

ML recommended action:

Increased load on the front end (Webserver01) caused backend (DB07StorageArray) queues to build up. Increased WorkQueueLength to 5000, error cleared.

Over time, the accuracy will improve but the requirement is that the teaching must happen. Once the accuracy improves, the recommended action turns into remedial actions to get to a self-healing system.

Remedial Actions

Remedial actions imply data synchronisation and tight integration, especially if it is a real-time requirement.

Using the above example, if the ML module has identified that the slow responses of the SystemHR can be rectified by increasing a queue length variable on a specific server, then the integration must allow for a configuration change from the integration layer.

Special care must be taken not to have hardcoded logic in remedial actions. To increase a queue length to 5000 is restrictive but to increase the current value by 30% might be the better long-term remedial action.

Do not underestimate the importance of the correct teaching and the human investment required.

By Emile Biagio (CTO Sintrex)

Last modified on Monday, 27 September 2021 12:17
(1 Vote)

Leave a comment