GIT-CERCS-05-10
James F. Bowring, Mary Jean Harrold, James M. Rehg,
Improving the Classification of Software Behaviors using Ensembles
One approach to the automatic classification of program behaviors is to view
these behaviors as the collection of all the program's executions. Many
features of these executions, such as branch profiles, can be measured, and if
these features accurately predict behavior, we can build automatic behavior
classifiers from them using statistical machine-learning techniques. Two key
problems in the development of useful classifiers are (1) the costs of
collecting and modeling data and (2) the adaptation of classifiers to new or
unknown behaviors. We address the first problem by concentrating on the
properties and costs of individual features and the second problem by using the
active-learning paradigm. In this paper, we present our technique for modeling
a data-flow feature as a stochastic process exhibiting the Markov property. We
introduce the novel concept of databins to summarize, as Markov models, the
transitions of values for selected variables. We show by empirical studies that
databin-based classifiers are effective. We also describe ensembles of
classifiers and how they can leverage their components to improve
classification rates. We show by empirical studies that ensembles of
control-flow and data-flow based classifiers can be more effective than either
component classifier.