A central nervous system for automatically detecting malware

It’s nearly impossible for anti-virus protectors to keep up with the pace of malware – producing descriptions of what that malware looks or acts like – around the clock, especially with forty thousand new and unique malware instances every day. And things are only getting worse.

© Michal Mrozek | Dreamstime.com

© Michal Mrozek | Dreamstime.com

Despite the fact that malware wants to hide itself, let’s argue that there are secure ways anti-virus protectors could learn about all installations of software – good and bad – that any of their end-users perform. Let’s also assume that they could easily collect other data from these machines and users: geographic location, social networking information, type of operating system, installed programs and configurations.

If they could collect this information, it would enable them to quickly identify new malware strains without even looking at the code.

How is this possible?

We’ll argue that if you know the circumstances of software installations and executions, then you can often tell what kind of software it is without even looking at the code. This information can auto-inform anti-virus protectors. It can be used to provide immediate advice to a client machine, which turns to the “centralized [malware] nervous system” to ask whether a particular piece of code is safe to install or not. Let me provide a few examples of how this could work.

Geographic location. Consider a sequence of installations of some unknown program, performed over a short period of time within a small geographic area. Malware installation patterns, seen as a function of time, typically do not have a strong geographic component. But wait! Some malware does – for example, malware that spreads over Bluetooth or Wi-Fi channels, infecting machines close to them. Of course, if everybody in a large local company patched some software at once this would also show up as geographically correlated installations. But the same patch is likely to also be installed in many other places at the same time, and will not spread like rings on water.

Social graph. Now imagine a graph representing all the computers in the world, where two nodes are connected to each other if and only if the owners of the corresponding two machines know each other – or, more practically, if one of them lists the other in his address book. In plotting installations of new software, does it seem to spread along the vertices (the connections between the nodes) of the graph? Several infamous types of malware (like the Melissa virus) did just that, since they spread using the address books of infected machines. We don’t need to know what the software looks like or what it does to determine if it is good or bad – we only need to look at the pattern of installations.

However, a legitimate application – advertised on a social network and shared between friends – will also spread along social connections. But consider the speed at which the Melissa virus spread: a moment after a machine was infected, 50 emails were sent out to people in the address list. No matter how much users love the app their friends told them about, everyone is not likely to act that fast. And while malware writers can artificially slow down their spreads to avoid automated detection, that action helps anti-virus companies distribute patches in a timely manner.

Time. Automated patching occurs around the clock, and worms infect no matter what time of day. But a Trojan, for example, depends on its victim being awake – the user has to approve its installation. Roughly speaking, if the malware takes advantage of a machine vulnerability, it often will spread independently of the local time of the day (to the extent that people leave their machines on, of course), whereas malware that relies on human vulnerabilities will depend on the time of the day (as does most legitimate software).

Behavior. Malware typically behaves in a very static manner – either it uses the address book, or it does not; either it spreads over Bluetooth, or it does not; and so on. But legitimate software is different. Think of a game: a small number of enthusiasts play it, and tell their friends about it. Then, a local newspaper picks up a story about the game, and lots of people in the city where the newspaper is published – whether they know each other or not – start playing it. Some of them are in the same neighborhood, others are miles away from the closest person who also installed the game. The patterns change for many kinds of legitimate software, but not for typical malware.

Yield. This is the term used to measure the chances that a machine that could become infected actually does become infected. Or, for legitimate software, the chances that a person who is given the opportunity to install the software decides to do so. Well-crafted malware has remarkably higher yields than the most exciting legitimate programs (except programs that everybody needs). If you know what machines receive an “installation opportunity” for some software and what machines take it, you know the yield. Now, if you consider the yield in the context of what software a user has already installed, it gets very interesting. Some malware can only infect if it can take advantage of a vulnerability in some particular kind of legitimate software. Don’t have that software? Then you can’t be infected. There are fewer such dependencies for most legitimate software.

Change and speed. The best thing malware authors may do to hide from anti-virus software is to use polymorphism – code that changes itself as it moves from machine to machine. Each machine that receives it will see something new. But that shouldn’t mess up the automated detection if the system is immediately suspicious of all new code – which makes sense, since legitimate software has no reason to be polymorphic. Popular software that spreads at a reasonable pace is likely to be legitimate, but we can be suspicious of new software that spreads rapidly.

The insight is: let’s ignore what the malware does on a machine, and instead look at how it moves between machines. That is much easier to assess. And the moment malware gives up what allows us to detect it, it also stops being a threat.

Can this be done?

Of course, I shared the above with the assumption that this type of installation information can be harvested from millions of client machines, infected or not. I believe this is possible, and will share some thoughts here soon.


 Editor: Sonal Chokshi

Additional information

Focus Areas

Our work is centered around a series of Focus Areas that we believe are the future of science and technology.

Licensing & Commercialization Opportunities

We’re continually developing new technologies, many of which are available for Commercialization.


Our scientists and staffers are active members and contributors to the science and technology communities.