REU Research Projects

Project 1: Intrusion detection for cyber-physical systems

Cyber-physical systems (CPSs) are new class of engineered systems which integrate physical resources with computational and communication components. Since the cyber domain and physical domain are highly integrated, to secure the systems from attacks is a great challenge. The intrusion detection system (IDS) has been shown to be an effective way to monitor a network or system for malicious attacks. The design of CPS IDS is a challenging work which has to consider the unique properties of CPS. The aim of this project is to develop a distributed data-driven IDS for a large scale CPS like smart grids. Novel data-driven algorithm that combine knowledge and behavior will be developed to detect both existing and zero-day attacks. Bio-inspired approaches will be investigated as possible solutions to develop the algorithms and systems.

Project 2: PDF malware detection with visualization techniques and deep learning

PDF (Portable Document Format) is a file format invented by Adobe for presenting, exchanging and archiving documents that is independent of hardware, software, and operating systems. As one of the most used file formats, PDF documents have become one of the major vectors for malware attacks. This is mainly due to the flexibility of PDF file structure and the ability of embedding different kinds of contents such as JavaScript code, encoded streams and image objects etc. These features can be exploited by attackers to embed the malware in PDF files using tools like Metasploit. For example, it was reported that the current popular Ransomware can be hidden inside PDF documents to launch the attacks. Various PDF malware analysis techniques have been proposed to address the challenges of PDF malware attacks, including keyword-based techniques, tree-based techniques, code-based techniques and machine learning-based techniques. This project will attack the problem using a different approach. The image visualization techniques and deep learning will be investigated to build advanced models for PDF malware detection.

Project 3: Privacy preserving for location-based services using spatial transformations

While mobile users would like to use location-based services (LBSs) to obtain answers to queries such as ‘Find me the nearest Italian restaurant with a rating > 3 on’, they would also like to preserve their privacy by not disclosing their exact location. This project will investigate the technique of hiding the exact location through a spatial transformation. Hilbert curves are space-filling curves that provide such spatial transformation through hash functions with limited preservation of proximity of the domain. This provides a mapping from (x, y)-coordinates of points of interest (e.g., restaurants) into non-negative integers. So, a trusted entity is employed which transforms (encrypts) the location of each point of interest into a corresponding integer and sends these encoded locations to the Location-based Server while sharing the parameters used for the transformation (encryption key) only with the user (i.e., not with the LBS).

To query the LBS, the user would encrypt his/her own location and send that to the LBS, which would find the nearest point of interest in the encrypted space and return that location. The user then decrypts the returned location to find the actual location. There are two interesting research questions to be investigated. First, to what extent is the ‘nearby’ point of interest in the encrypted space actually close in the (x, y) space? How effective are heuristics such as the use of two orthogonal curves to mitigate non-proximate locations? Second, to what extent can an adversary perform the decryption given limited knowledge of the parameters used in the creation of the Hilbert curve? In particular, are there some parameters that are more crucial than others?

Project 4: Predicting Zero-day Attacks Using Data Science

Enterprise software abounds with security vulnerabilities which, when exploited through malicious attacks, result in serious loss to the organization. Fortunately, a large number of such weaknesses are known and result in updates or patches of operating systems and application software, which can preempt known attacks if applied. However, there are a few problems: (1) in reality, the recommended patches are often not applied in a timely fashion; (2) attacks exploiting a flaw occur before a software remedy is ready; and (3) a hitherto unknown flaw to the community is the cause of an attack. The latter two are also known as zero-day attacks which are the hardest to address. In this regard, one observation is extremely useful: many zero-day attacks are multi-step ones, i.e., even when they capitalize on an unknown flaw (known as a zero-day exploit), that flaw is part of a sequence of known vulnerabilities. For example, consider an attacker installing a trojan horse by first using brute-force key guessing to gain root privilege on an SSH server and then using NFS misconfiguration allowing installing files through a public directory. This sequence would have been a zero-day attack at a time when a particular trojan horse itself was novel but the steps needed for installing the trojan horse were known.

Common Vulnerabilities and Exposure (CVE) is a database which lists publicly disclosed cybersecurity vulnerabilities, with each issue given a unique CVE number. The aim of this project is to predict such attack paths from known vulnerabilities listed in CVEs. For example, CVE-2008-0166 and CVE-2008-1000028 would have described the first two steps of the trojan horse example. In this project, we will download and pre-process the NVD’s CVE list using Python natural language libraries to extract keywords of interest (OS resources and their effects). A machine learning model will be trained to map CVE keywords to an edge between two nodes (e.g., an edge from a process to a socket), which will be used to generate the graphs for known instances of attack paths according to the timestamps of the CVEs. We will examine the known multi-step zero-day attacks, the time at which they occurred and show how they fit with the graph for that time. The last step of this project will train a classifier to detect certain missing edges as potential path concatenation indicating zero-day attack vectors.

Project 5: Automated Threat Report Generation Using Natural Language Processing

Threat reports describe the tactics, techniques, and procedures of cyber attacks, which are shared among concerned individuals to create the awareness of such attacks. The reports are typically generated manually, potentially with the help of tools such as sandboxes. Since threat reports are generated and published by many different organizations, they usually lack a common structure and sometimes even information content. Thus, there is a great need for automatically generating structured threat reports from the reports of different organizations so that the attacks can be timely and cost-effectively detected and mitigated. On the other hand, there are many challenges to ultimately achieve the goal of automated threat report generation such as non-standard report formats, threat categories, and limited labeled data availability.

The goal of this project is to use natural language processing (NLP) techniques to address several challenges towards automated threat report generation. The project includes the following sub-tasks: (1) automatically categorize the threat reports based on attack families; (2) determine if two threat reports are generated from the same attack family; (3) generate a structured threat report from a partial report using examples and other information of the corresponding attack family. We will solve critical technical issues for those sub-tasks including the adapting of NLP techniques to the specific domain of threat report. Since many NLP techniques use pre-trained word-vectors generated from general text, adapting them to a specific domain such as threat reports is non-trivial. Different domain adaption techniques such as model-centric, data-centric and hybrid will be investigated in this project to find the most suitable method.

Project 6: Malicious Browser Extension Detection with Advanced Machine Learning Techniques

Browser extensions are created using web technologies such as HTML, CSS, and JavaScript for users to customize the browsing experience and improve the functionalities of browsers. Google Chrome extensions, Opera add-ons, and Safari extensions are typical examples of browser extensions. Browser extensions can access privileged APIs to arbitrarily change the content of webpages and gain access to privacy information, which can be abused by adversaries as an attack vector. Through malicious browser extensions, adversaries can launch man-in-the-browser (MitB) attacks to hijack session cookies, steal privacy information, and modify the data inside the browser. In this project, we aim to detect malicious browser extensions using advanced machine learning techniques. In the literature, there are a large number of features extracted from the extensions including both static features extracted from an extension’s manifest permissions, content scripts, background pages, and CSS files, and dynamic features extracted from extension API calls, DOM (Document Object Model) operations, and network requests during run-time. Thus, a feature selection process is necessary for evaluating the importance of these features and select the most relevant ones. We will investigate bio-inspired feature selection algorithms such as Particle Swarm Optimization (PSO) which have demonstrated superior performance of extracting qualified feature subsets in many real-world problems. Various machine learning algorithms will be studied to build the detection models using the selected features. Advanced techniques such as ensemble learning and cost-sensitive learning will be explored to improve the detection performance. The participants will collect a dataset consisting both legitimate and malicious browser extensions from online sources for this project.

Project 7: Cybersecurity Risks in Precision Agriculture: Modeling and Analytics

By 2050 global food production will have to increase by approximately 70% to meet the needs of a rapidly growing population. Precision agriculture plays a critical role to achieve this goal, which adopts modern information and communication technologies (ICT) and information management systems to shift the agricultural industry from labor-intensive to technology-native to significantly increase productivity. On the other hand, the wide adoption of these cyber technologies and systems in agriculture, including crop and livestock production processes, has exposed an industry that was mostly mechanical to growing cyber-attacks. The goal of this project is to have a better understanding of the cybersecurity risks in agricultural production processes that are equipped with modern precision agriculture technologies. We will adopt two approaches to attack this problem as this is a new research area with limited existing data: (1) we will adopt an agent-based modeling approach to simulate the interactions between precision agriculture technology adoption and cybersecurity risk. The simulation allows us to conduct cost-benefit analyses of different risk management strategies and policies; (2) we will develop a web crawler to collect the data of cyber attack events related to precision agriculture from different online sources. The collected data will be analyzed using proper statistical/econometric models (e.g., dynamic discrete choice model). It allows us to develop analytical tools that can provide predictive insights on future attack events and risks.

Project 8: Securing In-door Wireless Networks Using Modulating Retro-reflector (MRR) Tags

Short range radio-based wireless communication technologies, such as Zigbee, Bluetooth and WiFi, are existing candidates to support two-way communication between the smart appliances, IoT sensors, mobile terminals, and the gateway in an indoor environment. However, this comes with a price as omnidirectional radio signals are vulnerable to intentional interceptions. Common cyber attacks in a radio-based wireless network include passive attack, masquerading, replay attack, denial of service (DoS) attack, and man-in-the-middle (MITM) attack. In a WiFi based network, pre-shared key WPA and WPA2 remain vulnerable to password cracking attacks. Once adversaries discover the pre-shared key (PSK), they can potentially decrypt all packets encrypted with the PSK. Although more advanced 802.1X authentication provides a stronger key protection, it requires a RADIUS server and possibly also an Active Directory server, which will be costly for residential and small business settings.

In this project, we will develop a heterogeneous radio frequency (RF) and modulating retro-reflector (MRR) system to secure the wireless communication between the user equipment (UE) and the wireless network gateway in an indoor environment. The UE in an indoor wireless network connects to the gateway through two different wireless channels – a duplex MRR link for data exchange of sensitive tasks, e.g., key exchange, mutual authentication, association process; and a duplex RF link for transmitting and receiving data encrypted by the key shared through the MRR link. The MRR solution, due to its extremely narrow interception range, will be adopted by interfacing MRR tags with the UE. Optical modulator, e.g., liquid crystal shutters, are mounted on top of the retro-reflective tags to modulate the uplink data transmission from the tags to the lighting infrastructure. In addition, a cross-layer protocol will be implemented to complete the process of authentication, association and key management using MRR interface and performs the encrypted data exchange using conventional radio interface.