Wei Xu Palo Alto Networks
Xinran Wang Palo Alto Networks
Huagang Xie Palo Alto Networks
Yanxin Zhang Palo Alto Networks
download slides (PDF)
PDF has become a popular vector for malware distribution as well as other malicious activities. Given the prevalence of malicious PDF documents in the wild, existing approaches for detecting malicious PDF documents or malicious content within a document are limited by their run-time performance and scalability. To address this issue, we propose a fast and precise malicious PDF filter.
Based on our analysis of the characteristics of malicious PDF documents, we extract a set of novel and predictive features, such as malformed cross-reference and suspicious filter pipeline. To the best of our knowledge, over a dozen of the proposed features have not been seen in previous work.
We also propose a systematic classification of features to cover various aspects (i.e. document structure, embedded code and PDF functionality) of a malicious PDF document. To better leverage these features using machine learning techniques, we studied the trade-offs between performance and accuracy on different machine learning models and chose a linear model for the filter. In the implementation, we tuned the system based on the predictivity of different features, the strength of different models and the feedback from the training phase to maintain a high accuracy. This tuning process can also adjust the system to serve various practical purposes, e.g. a pre-filter in a multi-level detection system, standalone intrusion detection module. Our evaluation on over 25,000 labelled PDF documents and over 150,000 real-world PDF documents demonstrates both a low false positive rate and high detection accuracy. Moreover, it also shows that the performance of the filter is suitable for online scenarios such as residing in a firewall.