Haven't We Met Before?
To help identify suspected malware, a CA research project aims to create
a virtual shakedown for viruses, Trojans and other malicious software.
By Jason Compton
Winter 2007
When it comes to attacks by malware, the stakes for enterprise IT
executives are high – and getting higher. “It’s no longer kids in
high school showing off what they can do,” says Tim Ebringer, a
researcher at CA Labs. “Now it’s organized crime.”
The accelerating pace of malware development and deployment
has heightened fears that the “bad guys” might win on sheer
volume. CA Labs’ pioneering research aims to ensure that they won’t. “Now that there
are very organized groups of people creating this malware, we need content research to
become a systematic discipline,” Ebringer adds.
The CA development facility in Melbourne,
Australia, is conducting some of the company’s
most important malware research. CA
researchers are building malware defense and
clean up by hand, unrolling and decrypting code,
describing the destructive payload and its effects,
and looking for patterns that can reveal how
best to protect against and disable unwelcome
programs. The increasing volume of new malware
exploits means that brute force solutions
are no longer adequate. “A lot of malware that
comes in the door is the same, or a very small
variant, of something we’ve seen before, but
because of the way malware can pack itself, disassembling
and reassembling itself, every iteration
has a completely different outward appearance,”
Ebringer says. Human workers are challenged
to quickly identify two pieces of code
that may look different to the naked eye but are
in fact the same program.
To combat the difficulties and gain ground
against the malware gangs cranking out attacks
with increasing frequency and ingenuity, CA’s
Melbourne lab is augmenting human intuition
by automating the investigation and classification of
unwelcome code. One ambitious project
is devising methods that will automatically strip
malware of its protective layers of obfuscation,
break down its component parts into common
classifications and use that information to expedite
creation and deployment of defenses.
Saving days—even hours—by targeting the
precise nature of a piece of malware can mean
the difference between a just-in-time patch and
disaster: a compromised IT infrastructure.
Obfuscation takes many forms: code that is
encrypted on disk and is decrypted by the virus
using an integrated key or hacking its own
code; code that is assembled on the fly from
seemingly innocent data; and code that is
executed unconventionally, by exploiting
little-known quirks of the operating system.
These methods are all detectable, but human
detection takes time. “You can have a packer
on top of an encrypter on top of another
encrypter. Getting to the payload is like
peeling an onion,” Ebringer says. “By hand,
unpacking is an arms race we can’t win.” So
CA Labs is eagerly backing research on a completely
new approach to malware defense—a
perfect malware peeler, if you will.
In the Key of Malice
Leading the research on next-generation,
automated, precision deobfuscation is Serdar
Boztas, an associate professor in the department
of mathematical sciences at RMIT
University in Melbourne. Boztas uses an information
theory approach to study the patterns
and fingerprints of obfuscated code. With
known samples of viruses and the measures of
obfuscation they used, Boztas’ group aims to
identify not only how obfuscated code protects
itself but also how it can be unrolled to its component
parts without human intervention.
“The ultimate goal is to have an engine which
will do this successfully with each case of malware
it sees,” he says.
Each time the deobfuscation engine
encounters an unfamiliar method of code protection
it can—either automatically or with
guidance from a researcher—incorporate that
method into its database, but not just as a simple
signature. The pattern represents an
approach that can be detected in another
malicious program even if it is implemented
differently or with a different encryption key.
Boztas is not especially troubled by the
rapidly changing nature of malware threats.
“This is why our information-theory, constructed-
modeling approach is likely to be
more successful than completely system-based
approaches,” he says. “Operating systems will
change; environments will change; malware
will change, but patterns are patterns.”
Accelerating the stripping down of malware
is part of the solution. Once the malware
is exposed, CA Labs wants to know whether
the virus or the Trojan is the true threat. That’s
where a second research team takes over. “Our
first effort is to come up with a universal language
for malware,” says Lynn Margaret
Batten, a professor at Melbourne’s Deakin
University and director of its Information
Security Group. “We’re looking to define the
dictionary, the grammar and the variants of
this language by using statistical analysis on
the malware datasets that CA has
made available.”
Although malware
threats are already categorized
according to payload,
distribution vectors and
other key characteristics,
the terminology is not fully
standardized and varies
from lab to lab and from IT
platform to IT platform. “A
nice, formal, automated
method of putting everything
on the same baseline
doesn’t exist,” Batten says.
“We decided to build a language, starting from
scratch and thinking about what we would
need to describe everything about malware.”
Team Players
Batten’s researchers are experienced not just in
malware identification and defense but also in
statistical methods and programming-language
design. “Defining the distance between two
pieces of code can be tricky in
the abstract, so we need to come
up with a way to classify how
close these pieces of software are
to each other. That will tell us
when two pieces of malware are
part of the same virus family,”
Batten says. “Right now, it takes
a researcher several days to
determine how ‘close’ two
pieces of malware are.” A common
language will make it much easier to
assign scores to each aspect of malware and
provide a basis for comparison of two software
exploits that, to the naked eye, seem unrelated
but whose proximity in terms of malware language
could provide useful clues to defend
against both.
An intermediary language could provide
numerous benefits to antivirus
professionals. Such a language
could be the basis of a virtual
machine designed solely for
the identification, simulation
and prevention of malware
attacks. Currently,
researchers use sandbox
machines or controlled emulation
environments, but these
approaches typically require
more manual intervention
than the proposed virtual
machine would. Furthermore,
the language would be an
invaluable tool for comparing
malware and identifying code
that can be treated in the
same fashion as another threat and recognizing
other code that calls for unique defenses and
cleanup tools.
The extensive “malware zoo” at CA, which
catalogs and archives known threats for
antivirus-engine development, plays a major
role in support of the theory behind the new
research. It provides a large database of known
malicious code, which the researchers use as a
baseline before setting their
creations loose on current
threats. “We’re delighted by
CA Lab’s participation here,
because access to their malware
zoo is critical to the project,”
Batten says.
The immediate applications
for this research are at the
professional level of information
security rather than for a
desktop or server antivirus scanner. Such scanners
are expected to handle thousands of files
per minute, but the scanning and classification
these researchers are aiming for require a
more in-depth approach. “We would like
eventually to be able to use these automated
tools to remove malware and create new binaries without
the junk code—to reverse all the
obfuscation to the extent that it’s possible,”
Ebringer says.
CA Labs’ malware researchers are not
building a quick fix. Each component of the
project has a three-year life span. Nobody can
promise total information security. But with
the right combination of luck and skill, future
generations of IT defense will involve not only
clever researchers who toil through the night,
but also omnipresent malware-defense systems.
These systems will learn about and instantly
strip suspicious code down to its component
parts and identify a likely solution. If you like,
evening the stakes.
Jason Compton is a technology journalist based in the
Madison, Wisc. area. His work has been featured in more than
40 technology publications.