ExtraHop Shares Huge Dataset for Detecting Domains Generated by Algorithm on GitHub

September 12, 2023

ExtraHop Shares Huge Dataset for Detecting Domains Generated by Algorithm on GitHub

Over the past 25 years, detecting and blocking traffic from botnets and other malicious domains has been difficult for cybersecurity professionals as threat actors have used sophisticated techniques to avoid being shut down.

Specifically, threat actors increasingly employ domains generated by algorithm (DGAs) in malware and botnet operations to establish communication between infected computers and their command-and-control (C&C) servers.

In traditional botnet architectures, the C&C server's IP address or domain name is hard-coded into the malware. However, this makes it easier for security professionals to identify and block these servers. To overcome this limitation, attackers employ DGAs to dynamically generate a large number of domain names that can be used to communicate with the C&C server.

DGAs generate domain names using an algorithm that takes into account factors such as time, date, seed values, or other variables. By generating a large pool of potential domain names, the malware can attempt to connect to one of these domains periodically or when certain conditions are met. This makes it difficult for security systems to predict or block the exact domain names used by the malware.

The purpose of DGAs is to make it harder for security researchers and network administrators to disrupt or shut down the communication channels between infected computers and the C&C server. By constantly changing the domain names, malware authors can maintain control over the botnet and continue to issue commands to compromised systems without detection.

Detecting and mitigating DGAs is a challenging task for cybersecurity professionals, as they need to analyze the algorithm used by the malware, monitor DNS requests, and implement advanced techniques to identify and block malicious domain names generated by the DGA.

New Tool to Help

Today, ExtraHop has taken a significant step toward helping organizations defend against DGA-aided attacks by releasing a massive open source, machine learning dataset designed to defend against DGAs on GitHub.

The dataset, one of the largest available for this use case, consists of 16 million rows of data. In contrast, many other datasets we’ve reviewed contained much less data and several limitations.

Originally built for the ExtraHop Reveal(x) network detection and response (NDR) platform, this data set can now be used by any security researcher to construct their own machine learning (ML) classifier model to more quickly identify DGAs and intervene in attacks with greater speed and precision. Since its implementation in Reveal(x), the ExtraHop DGA model has demonstrated more than 98% accuracy.

Improving the Data for Detecting DGAs

ExtraHop began the research resulting in this dataset because we were not satisfied with the performance of existing models to identify DGAs. The ExtraHop team made several attempts to improve the models using feature engineering, model selection, and testing before hitting upon some methods to improve the accuracy of the data.

We went through several cycles of reviewing academic research on DGAs, testing model architectures, feature engineering the data, writing training and testing code, and training and testing the models to ultimately create the dataset we are releasing.

Feature engineering, involving the extraction and transformation of variables from raw data, was an important part of the process. To create a good DGA tool, we needed both a good dataset and a strong feature engineering process.

Make It Simple

When we started on the project, we used available automated feature engineering tools, but they produced overly complex features that in many cases negatively impacted the model. These tools have improved dramatically since then, but at the time, we realized we didn’t need to use such complicated tools, and we ended up using a rather simple method for encoding symbols.

Ultimately, our feature vector was simple:

We made a list of all legal characters that can be in a domain name. In python it looks something like this: keys = [‘a’,’A’,’1’,’2’, ….]
We created a lookup table of keys to integer values. In python it looks like this: lookup_table = {}, then lookup_table[‘A’] = 1, lookup_table[‘B’] = 2 and so on.
To ensure we were not injecting an ordering bias or magnitude bias, we randomly assigned both the keys and values for each.

To test the dataset, ExtraHop found three methods of identifying DGAs with promising results:

The LTSM method was based on the research paper, “Predicting Domain Generation Algorithms with Long Short-Term Memory Networks.” We implemented the method in TensorFlow 2.0, and trained, tuned, and tested it using Sagemaker. Its accuracy was over 93.6%.
The UW method came from the research paper, "Character Level Based Detection of DGA Domain Names." We also used TensorFlow 2.0 and Sagemaker. The accuracy was over 95.1%.
Finally, we used the existing XBoost algorithm, and applied it to the DGA problem. It was implemented Using Xgboost 0.90-1 and trained, tuned, and tested in Sagemaker. Its accuracy was over 94.8 percent.

With this dataset, we were able to demonstrate the accuracy of results using these models, and we hope others in the cybersecurity community can use the dataset to implement a predictive DGA model that is highly accurate and protect their organizations against malware, botnets, and other attacks. Download the dataset at GitHub.

Discover more

Zero TrustCompany

Todd Kemmerling

Director of Data Science

A highly motivated and accomplished technologist with over 20 years' hands-on experience designing, managing, planning, and implementing state-of-the-art software and electronic systems. I have a passion for solving hard problems that produce value for everyone.