Machine learning is cybersecurity’s latest pipe dream

Image from Pixabay

A recurring claim at security conferences is that “security is a big data / machine learning (ML) / artificial intelligence (AI) problem”. This is unfortunately wildly optimistic, and wrong in general. While certain security problems can be addressed by ML/AI algorithms, in general the problem of detecting a malicious actor amidst the vast trove of information collected by most organizations, is not one of them.

Our faith in AI is based on personal experience (“everything cloud is big data and good”) and the memes of the consumerization era. It is tempting to project this optimism into an enterprise context: The idea that it ought to be possible to sift through large amounts of data to find signs of an attack of breach is intuitively reasonable. Moreover, every IT pro managing systems at scale is aware of the value of sophisticated tools that help them to pick through large volumes of data to find relevant information to aid trouble shooting and even security investigations.

First generation tools such as SIEM systems (Security Information and Event Management) gave security teams a new way of correlating events and triaging large volumes of noisy data. Solutions emerged that borrowed the approach of Google to index logs and alerts, adding powerful search and manipulation capabilities to allow teams to be ever more effective in finding faults and security violations. These tools have helped security teams enormously, but still leave two challenges: vast amounts of data of unknown value that we don’t know when to discard; and the nagging worry that the security team may have missed a needle somewhere within the haystack – that is, a concern that the algorithms may be imperfect and miss the bad guy anyway.

Is machine learning the answer to the security problem? Again, this is an imprecise statement of the problem and the potential set of solutions. In this piece I want to focus on attack detection. In this domain we need to ask two questions: Can an algorithm reliably find the needle in the haystack (the tiny differences from “normal” behavior that might be indicative of an attack)? Second, can such an algorithm increase our confidence in the absence of an attack – effectively enabling us to be sure that there would be no loss if we discard the haystack of data representing the organization’s normal activity? AI and ML are broadly viewed as magical technologies that will transform human experience. It’s an enormously seductive idea. We’ve all experienced the power of machine learning systems in Google search, the recommendation engines of Amazon and Netflix and the powerful spam filtering capabilities of webmail providers such as Gmail and Outlook. Former Symantec CTO Amit Mital once said that machine learning offers of the “few beacons of hope in this mess.”

But it’s important not to succumb to hubris. Google’s fabled ability to identify flu epidemics turned out to be woefully inaccurate, and while the domain of cyber security is characterized by weak signals and a huge number of variables to track, intelligent actors have a large attack surface to exploit. Unfortunately there is no guarantee that using ML/AI will leave you much better off than before – which is relying on skilled experts to do the hard work. Unfortunately that has yet to stop the marketing spin.

But what is normal?

It’s important to remember there is no silver bullet in security, and there is no evidence at all that tools such as ML and AI can solve the problem. ML is good at finding similarities between things (such as spam emails), but it is not so good at locating anomalies. In fact, any discussion of anomalous behavior presumes that it is possible to describe normal behavior. Unfortunately, decades of research confirm that human activity, application behavior and network traffic are all heavily auto-correlated, making it hard to understand what activity can be categorised as ‘normal’. This gives malicious actors plenty of opportunity to “hide in plain sight” and can even give them the opportunity to train the system to believe that malicious activity is normal.

The difference between Trained and Untrained Learning

Any ML system must attempt to separate and differentiate activity based either on pre-defined (i.e. trained learning) or self-learned classifications. Training an ML engine using human experts seems like a great idea, but assumes that the attackers won’t subtly vary their behaviour over time in response. Self-learned categories are often impossible for humans to understand. Unfortunately, ML systems are not good at describing why a particular activity is inconsistent with normal behaviour, and how it is related to others. So when the ML system delivers an alert, security teams still have to do the hard work of understanding whether or not it is a false positive, before trying to understand how the anomaly is related to other activity within the system.

Is It Real?

There is a quite a big difference between being happy when Netflix recommends a movie you like, and expecting it to never recommend a movie that you don’t. So while applying ML to your security feeds might deliver some helpful insights, you cannot rely on such a system to reliably deliver only valid results. In the cyber security industry, the difference is cost, time spent understanding why an alert was triggered and whether or not it is a false positive. Ponemon research estimates that an archetypal large enterprise spends up to 395 hours per week processing false alerts – a cost of approximately $1.27 million per year. Unfortunately, organisations also cannot rely on a ML system to find all anomalies, so there is no way to know if an attacker may still be lurking within network, and therefore no way to know when to throw the data away.

Experts Are Still Better

Cybersecurity is a field where human expertise will always be needed to pick through the subtle differences between anomalies. Rather than waste money on the unproven promises that ML and AI-based security technologies are promoting, it is wiser for companies to invest in experts, and in tools that enhance their ability to quickly search for and identify components of a new attack. In the context of endpoint security, an emerging category of tools that Gartner calls “Endpoint Detection & Response” play an important role in equipping security teams with real-time insight into indicators of compromise on the endpoint. Here, both continuous monitoring and real-time searches are key.

ML Cannot Protect You

One final word of caution: As obvious as it may be, post-hoc analysis of monitoring data cannot prevent a vulnerable system from being compromised in the first place. Ultimately, we need to swiftly adopt technologies and infrastructure that is more secure by design. By way of example, segmenting the enterprise network and placing all PCs on a separate routed network segment, and making users authenticate in order to access privileged applications makes it much harder for malware to penetrate and move sideways in the organisation. Virtualization and micro-segmentation take this a step further, restricting the flow of activity within networks and making applications more resilient to attack. Overall, good infrastructure architecture can make the biggest difference to an organisations security posture – reducing the size of the haystack and making the business of defending the enterprise much easier.

Latest posts by Simon Crosby (see all)
Simon Crosby: CTO and co-founder, Bromium

View Comments (1)

  • Let’s wait until Cylance gains market share, then we will see how it does once they are on the hacker’s radar. Security through obscurity will only get them so far.

    Cylance works by reading the PE (portable executable) headers and other info pre-execution, and making a determination based on, well, reading a book by its cover. Whereas established AV companies like Webroot actually tear apart the malware in a sandbox / malware analysis machine, and do a full examination of all of the code. All of the established security companies use machine learning and Ai to a certain extent.

    It is quite funny how Cylance always talks about how they use math (like no one else does), and how it is the silver bullet, when Alan Turing discovered in the 1940’s that distinguishing between good and bad software with any level of certainty not a possibility. Google “P VS NP” and you will see what I mean. Malware detection is at least NP-Complete, and it is probably NP-Hard.

    Do their mathematicians not know this? (Seriously, google “P VS NP”, it is quite interesting.)
    Here is a video that explains P VS NP: https://www.youtube.com/watch?v=YX40hbAHx3s

    Also, they claim that they stop 99% of all viruses and malware. If there are 300,000 new viruses a day, does that mean that 3,000 viruses or malware bypass Cylance each day?

    Also, what about greyware? I will not even go into that for now.

    I truly appreciate their enthusiasm, and it would be nice if SOMEONE actually found the silver bullet for malware, but I am betting that they are being overly optimistic, and that it will be difficult to live up to the promises that they are making, and a lot of people will be disappointed. If they have solved the P VS NP problem, then that means that they have solved a lot of other problems as well, and that would be super cool. (Seriously, google “P VS NP”, it is quite interesting, you will see what I mean.)

    Even if what Cylance is claiming is true… is 99% detection sufficient? Would you board an airplane if you knew you had a 1% chance of it crashing? Also, if there is a 1% chance that a hacker can steal your data from your credit card company, is that acceptable? If there are 100,000 hackers trying to steal your info each day that means that 1,000 succeeded, and we all know there is much more than that.

    BTW, what they are doing is nothing new. Just google “malware machine learning” and you will see tons of academic papers on the subject, and projects such as this https://github.com/adobe-security/Malware-classifier. But yet, they are somehow trying to get patents on the “technology” when there is a massive amount of prior art that prevents them from doing so.

    In the interest of full discloser, I am a competitor of Cylance and Webroot. I believe the computer should be locked whenever it is at risk (application whitelisting). If you want to know more, I would be happy to talk about it, but I am not here for shameless self-promotion. BTW, I realize that Cylance Protect also has an application whitelisting component, but in my opinion, it should be their first layer of defense, and they should develop it further so that it is user-friendly enough to be adopted by the masses.

    I do not mean to be hard on Cylance, but it really is annoying that they believe that they are the only software developers who use math.

Related Post