Haven't We Met Before?
To help identify suspected malware, a CA research project aims to create a virtual shakedown for viruses, Trojans and other malicious software.

By Jason Compton
Winter 2007

When it comes to attacks by malware, the stakes for enterprise IT executives are high – and getting higher. “It’s no longer kids in high school showing off what they can do,” says Tim Ebringer, a researcher at CA Labs. “Now it’s organized crime.” The accelerating pace of malware development and deployment has heightened fears that the “bad guys” might win on sheer volume. CA Labs’ pioneering research aims to ensure that they won’t. “Now that there are very organized groups of people creating this malware, we need content research to become a systematic discipline,” Ebringer adds.

The CA development facility in Melbourne, Australia, is conducting some of the company’s most important malware research. CA researchers are building malware defense and clean up by hand, unrolling and decrypting code, describing the destructive payload and its effects, and looking for patterns that can reveal how best to protect against and disable unwelcome programs. The increasing volume of new malware exploits means that brute force solutions are no longer adequate. “A lot of malware that comes in the door is the same, or a very small variant, of something we’ve seen before, but because of the way malware can pack itself, disassembling and reassembling itself, every iteration has a completely different outward appearance,” Ebringer says. Human workers are challenged to quickly identify two pieces of code that may look different to the naked eye but are in fact the same program.

To combat the difficulties and gain ground against the malware gangs cranking out attacks with increasing frequency and ingenuity, CA’s Melbourne lab is augmenting human intuition by automating the investigation and classification of unwelcome code. One ambitious project is devising methods that will automatically strip malware of its protective layers of obfuscation, break down its component parts into common classifications and use that information to expedite creation and deployment of defenses.

Saving days—even hours—by targeting the precise nature of a piece of malware can mean the difference between a just-in-time patch and disaster: a compromised IT infrastructure. Obfuscation takes many forms: code that is encrypted on disk and is decrypted by the virus using an integrated key or hacking its own code; code that is assembled on the fly from seemingly innocent data; and code that is executed unconventionally, by exploiting little-known quirks of the operating system. These methods are all detectable, but human detection takes time. “You can have a packer on top of an encrypter on top of another encrypter. Getting to the payload is like peeling an onion,” Ebringer says. “By hand, unpacking is an arms race we can’t win.” So CA Labs is eagerly backing research on a completely new approach to malware defense—a perfect malware peeler, if you will.

In the Key of Malice
Leading the research on next-generation, automated, precision deobfuscation is Serdar Boztas, an associate professor in the department of mathematical sciences at RMIT University in Melbourne.  Boztas uses an information theory approach to study the patterns and fingerprints of obfuscated code. With known samples of viruses and the measures of obfuscation they used, Boztas’ group aims to identify not only how obfuscated code protects itself but also how it can be unrolled to its component parts without human intervention.

“The ultimate goal is to have an engine which will do this successfully with each case of malware it sees,” he says.

Each time the deobfuscation engine encounters an unfamiliar method of code protection it can—either automatically or with guidance from a researcher—incorporate that method into its database, but not just as a simple signature. The pattern represents an approach that can be detected in another malicious program even if it is implemented differently or with a different encryption key. Boztas is not especially troubled by the rapidly changing nature of malware threats. “This is why our information-theory, constructed- modeling approach is likely to be more successful than completely system-based approaches,” he says. “Operating systems will change; environments will change; malware will change, but patterns are patterns.”

Accelerating the stripping down of malware is part of the solution. Once the malware is exposed, CA Labs wants to know whether the virus or the Trojan is the true threat. That’s where a second research team takes over. “Our first effort is to come up with a universal language for malware,” says Lynn Margaret Batten, a professor at Melbourne’s Deakin University and director of its Information Security Group. “We’re looking to define the dictionary, the grammar and the variants of this language by using statistical analysis on the malware datasets that CA has made available.”

Although malware threats are already categorized according to payload, distribution vectors and other key characteristics, the terminology is not fully standardized and varies from lab to lab and from IT platform to IT platform. “A nice, formal, automated method of putting everything on the same baseline doesn’t exist,” Batten says.

“We decided to build a language, starting from scratch and thinking about what we would need to describe everything about malware.”

Team Players
Batten’s researchers are experienced not just in malware identification and defense but also in statistical methods and programming-language design. “Defining the distance between two pieces of code can be tricky in the abstract, so we need to come up with a way to classify how close these pieces of software are to each other. That will tell us when two pieces of malware are part of the same virus family,” Batten says. “Right now, it takes a researcher several days to determine how ‘close’ two pieces of malware are.” A common language will make it much easier to assign scores to each aspect of malware and provide a basis for comparison of two software exploits that, to the naked eye, seem unrelated but whose proximity in terms of malware language could provide useful clues to defend against both.

An intermediary language could provide numerous benefits to antivirus professionals. Such a language could be the basis of a virtual machine designed solely for the identification, simulation and prevention of malware attacks. Currently, researchers use sandbox machines or controlled emulation environments, but these approaches typically require more manual intervention than the proposed virtual machine would. Furthermore, the language would be an invaluable tool for comparing malware and identifying code that can be treated in the same fashion as another threat and recognizing other code that calls for unique defenses and cleanup tools.

The extensive “malware zoo” at CA, which catalogs and archives known threats for antivirus-engine development, plays a major role in support of the theory behind the new research. It provides a large database of known malicious code, which the researchers use as a baseline before setting their creations loose on current threats. “We’re delighted by CA Lab’s participation here, because access to their malware zoo is critical to the project,” Batten says.

The immediate applications for this research are at the professional level of information security rather than for a desktop or server antivirus scanner. Such scanners are expected to handle thousands of files per minute, but the scanning and classification these researchers are aiming for require a more in-depth approach. “We would like eventually to be able to use these automated tools to remove malware and create new binaries without the junk code—to reverse all the obfuscation to the extent that it’s possible,” Ebringer says.

CA Labs’ malware researchers are not building a quick fix. Each component of the project has a three-year life span. Nobody can promise total information security. But with the right combination of luck and skill, future generations of IT defense will involve not only clever researchers who toil through the night, but also omnipresent malware-defense systems. These systems will learn about and instantly strip suspicious code down to its component parts and identify a likely solution. If you like, evening the stakes.

Jason Compton is a technology journalist based in the Madison, Wisc. area. His work has been featured in more than 40 technology publications.