Study of the different algorithms and techniques for generating DGA and discrimination against non-DGA domains. I developed three tools in python 3:
- lookdga.py to create DGA domains and detect them by reg.exp.
- statsdga.py to calculate some statistics.
- detectdga.py to train a model and detect DGA domains using machine learning stuff.
Almost all DGAs algorithms were taken from https://github.com/baderj/domain_generation_algorithms and were adapted in the DGAs directory.
Johannes Bader's work with DGA malware is very valuable. Look it at https://johannesbader.ch/
The yara rules for the malware were extracted from https://malpedia.caad.fkie.fraunhofer.de/
Just clone the repository and install the dependencies with pip. It is preferable to use of a virtual environment of python3.7 or above:
$ pip install -r requirements.txt
This tool can calculate the domains of various malware that use DGA. It can also detect by domain possible DGA using regular expressions, and finally, it can bruteforce it in order to detect the day and order of generation.
List all DGA available:
$ python lookdga.py -L
View info about a DGA:
$ python lookdga.py -m zloader -I
Generate 5 domains of a DGA:
$ python lookdga.py -m tinba -n 5 -G
Detect possible DGA using reg.exp:
$ python lookdga.py -D google.com nvfowikhevmy.net
Bruteforce previous domains (it is also detected before of bruteforce):
$ python lookdga.py -B google.com nvfowikhevmy.net
This tool can calculate some domain statistics generated by lookdga.py. It can also calculate statistics for domains from the Alexa top million list.
Its use is self-explanatory:
$ python statsdga.py
In order to be able to use it, some domain list files are needed first. These files have to be saved in the ml-data directory.
First, download Tranco wordlist from https://tranco-list.eu/list/9Q72/full, and then split the list into three files:
$ sed -n '1,1500000p' tranco.csv | cut -d ',' -f 2 > tranco-main.dom
$ sed -n '1500001,2000000p' tranco.csv | cut -d ',' -f 2 > tranco-test2.dom
$ sed -n '2000001,20000000p' tranco.csv | cut -d ',' -f 2 > tranco-ngram.dom
Second, download the DGA domains for the main dataset from https://data.netlab.360.com/feeds/dga/dga.txt, and save it in the ml-data directory.
To create to second test dataset:
$ python lookdga.py -d 2019-12-31 -n 3000 -C > ml-data/mydgas.dom
The n-gram dictionary was create and saved in the repository, but it's possible to recreate it with:
$ python3 detectdga.py --ngram
Now, we can create the main dataset:
$ python detectdga.py --main
To create the secondary dataset or the alexa dataset use --secondary or --alexa options.
The repository has the full model, but it's possible to train it or train a different model (ngram, nosyll...):
$ python detectdga.py --train ngram
To check it with secondary model, use (alexa with --test_alexa):
$ python detectdga.py --test_secondary ngram
Finally, to test a domains with the full model: (-b option to try bruteforce, remove it to use only the ML model)
$ python detectdga.py -b --check ughdnmmgdpscliraqnpl.com nvfowikhevmy.net uquslaigwaannie.ddns.net google.com facebook.com
