Clean data profiling

Catherine Robinson, Julie Weber, Bartlomiej Uscilowski and Thomas Parsons Symantec

The volume of malicious software being created at present is so high that it has triggered discussion in the AV industry as to whether a blacklisting model is feasible in the future. In this context, clean data sets are becoming increasingly important and so is the need to classify them.

In this paper, we discuss problems and solutions related to gathering and profiling large clean data sets. We provide guidelines for gathering clean files and keeping them uncompromised, determining their level of trust and their intrinsic quality (usefulness).

We present a systematic approach to profiling files and managing the metadata in a clean set. Considering the nature of the data that needs to be extracted we group the profiling metadata into two categories: lower-level and higher-level information. The lower-level data is extracted automatically directly from files and contains information that helps in locating files and determining the type of files. Higher-level metadata consists of information that allows file categorisation. We present the possible sources of this information that could be obtained automatically or with manual annotations. We also attempt to define a naming convention for identifying software and standardising the type of data that can be queried.

Finally, we have a look at existing clean data sets, profiled and unprofiled, and their shortcomings for this particular usage.