July 24, 2018 feature

Using machine learning to detect software vulnerabilities

by Ingrid Fadelli , Tech Xplore

A team of researchers from R&D company Draper and Boston University developed a new large-scale vulnerability detection system using machine learning algorithms, which could help to discover software vulnerabilities faster and more efficiently.

Hackers and malicious users are constantly coming up with new ways to compromise IT systems and applications, typically by exploiting software security vulnerabilities. Software vulnerabilities are small errors made by the programmers who developed a system that can propagate quickly, especially through open-source software or through code reuse and adaptation.

Every year, thousands of these vulnerabilities are publicly reported to the Common Vulnerabilities and Exposures database (CVE), while many others are spotted and patched internally by developers. If they are not adequately addressed, these vulnerabilities can be exploited by attackers, often with devastating effects, as proved in many recent high-profile exploits, such as the Heartbleed bug and the WannaCry ramsomware cryptoworm.

Generally, existing tools to analyze programs can only detect a limited number of potential errors, which are based on predefined rules. However, the widespread use of open-source repositories has opened new possibilities for the development of techniques that could reveal code vulnerability patterns.

The researchers from Draper and Boston have developed a new vulnerability detection tool that uses machine learning for automated detection of vulnerabilities in C/C++ source code, which has already showed promising results.

The team compiled a large dataset with millions of open-source functions and labeled it using three static (pre-runtime) analysis tools, namely Clang, Cppcheck and Flawfinder, which are designed to identify potential exploits. Their dataset included millions of function-level examples of C and C++ code drawn from the SATEIV Juliet Test Suite, Debian Linux distribution, and public Git repositories on GitHub.

"Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code," the researchers wrote in their paper.

As programming languages are in some ways similar to human languages, the researchers designed a vulnerability detection technique that uses natural language processing (NLP), an AI strategy that allows computers to understand and interpret human language.

"We leverage feature-extraction approaches similar to those used for sentence sentiment classification with convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for function-level source vulnerability classification," the researchers explained in their paper.

They combined NLP with random forest (RM); a powerful algorithm that creates an ensemble of decision trees from randomly selected subsets of the training dataset and then merges them together, generally achieving more accurate predictions.

The researchers tested their tool on both real software packages and the NIST STATE IV benchmark dataset.

"Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection," they wrote. "We applied a variety of ML techniques inspired by classification problems in the natural language domain, fine-tuned them for our application, and achieved the best overall results using features learned via convolutional neural network and classified with an ensemble tree algorithm."

So far, their work has focused on C/C++ code, but their method could also be applied to any other programming language. They specifically chose to create a custom C/C++ lexer as this would produce a simple and generic representation of function source code, which is ideal for machine learning training.

More information: Automated Vulnerability Detection in Source Code Using Deep Representation Learning. arXiv:1807.04320v1 [cs.LG]. arxiv.org/abs/1807.04320

Abstract
Increasing numbers of software vulnerabilities are discovered every year whether they are reported publicly or discovered internally in proprietary code. These vulnerabilities can pose serious risk of exploit and result in system compromise, information leaks, or denial of service. We leveraged the wealth of C and C++ open-source code available to develop a large-scale function-level vulnerability detection system using machine learning. To supplement existing labeled vulnerability datasets, we compiled a vast dataset of millions of open-source functions and labeled it with carefully-selected findings from three different static analyzers that indicate potential exploits. Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. We evaluated our tool on code from both real software packages and the NIST SATE IV benchmark dataset. Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection.

Citation: Using machine learning to detect software vulnerabilities (2018, July 24) retrieved 7 May 2024 from https://techxplore.com/news/2018-07-machine-software-vulnerabilities.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

What are software vulnerabilities, and why are there so many of them?

206 shares

Feedback to editors

New large learning model shows how AI might shape LGBTQIA+ advocacy

5 hours ago

Computer scientists discover vulnerability in cloud server hardware used by AMD and Intel chips

9 hours ago

Why getting in touch with our 'gerbil brain' could help machines listen better

11 hours ago

New process brings commercialization of CO₂ utilization technology to produce formic acid one step closer

12 hours ago

Researchers engineer sound-suppressing silk to reduce noise transmission in a large room

13 hours ago

A framework to detect hallucinations in the text generated by LLMs

13 hours ago

Scientists create robot snails that can move independently using tracks or work together to climb

13 hours ago

Australian engineers develop an ultrasonic cold brew coffee machine

13 hours ago

3D video conferencing tool lets remote user control the view

May 6, 2024

Engineers create a caterpillar-shaped robot that splits into segments, reassembles, hauls and crawls

May 6, 2024

Load comments (2)

Using machine learning to detect software vulnerabilities

New large learning model shows how AI might shape LGBTQIA+ advocacy

Computer scientists discover vulnerability in cloud server hardware used by AMD and Intel chips

Why getting in touch with our 'gerbil brain' could help machines listen better

New process brings commercialization of CO₂ utilization technology to produce formic acid one step closer

Researchers engineer sound-suppressing silk to reduce noise transmission in a large room

A framework to detect hallucinations in the text generated by LLMs

Scientists create robot snails that can move independently using tracks or work together to climb

Australian engineers develop an ultrasonic cold brew coffee machine

3D video conferencing tool lets remote user control the view

Engineers create a caterpillar-shaped robot that splits into segments, reassembles, hauls and crawls

What are software vulnerabilities, and why are there so many of them?

Study examines 200 real-world 'zero-day' software vulnerabilities

Training artificial intelligence with artificial X-rays

NIST improves tool for hardening software against cyber attack

Team turns deep-learning AI loose on software development

Symantec urges users to disable pcAnywhere

Computer scientists discover vulnerability in cloud server hardware used by AMD and Intel chips

A framework to enhance the safety of text-to-image generation networks

Computer scientists unveil novel attacks on cybersecurity

Researchers develop tiny chip that can safeguard user data while enabling efficient computing on a smartphone

World-first 'Cybercrime Index' ranks countries by cybercrime threat level

Researchers find a faster, better way to prevent an AI chatbot from giving toxic responses

Phys.org

Medical Xpress

Science X

Using machine learning to detect software vulnerabilities

New large learning model shows how AI might shape LGBTQIA+ advocacy

Computer scientists discover vulnerability in cloud server hardware used by AMD and Intel chips

Why getting in touch with our 'gerbil brain' could help machines listen better

New process brings commercialization of CO₂ utilization technology to produce formic acid one step closer

Researchers engineer sound-suppressing silk to reduce noise transmission in a large room

A framework to detect hallucinations in the text generated by LLMs

Scientists create robot snails that can move independently using tracks or work together to climb

Australian engineers develop an ultrasonic cold brew coffee machine

3D video conferencing tool lets remote user control the view

Engineers create a caterpillar-shaped robot that splits into segments, reassembles, hauls and crawls

Related Stories

What are software vulnerabilities, and why are there so many of them?

Study examines 200 real-world 'zero-day' software vulnerabilities

Training artificial intelligence with artificial X-rays

NIST improves tool for hardening software against cyber attack

Team turns deep-learning AI loose on software development

Symantec urges users to disable pcAnywhere

Recommended for you

Computer scientists discover vulnerability in cloud server hardware used by AMD and Intel chips

A framework to enhance the safety of text-to-image generation networks

Computer scientists unveil novel attacks on cybersecurity

Researchers develop tiny chip that can safeguard user data while enabling efficient computing on a smartphone

World-first 'Cybercrime Index' ranks countries by cybercrime threat level

Researchers find a faster, better way to prevent an AI chatbot from giving toxic responses

Your Privacy