I still think, DNS logs are one of the most overlooked resources for intrusion and malware detection. Frequently, command and control servers will use specific top level domains or host names, and due to short TTL values, infected hosts will frequently query DNS servers for these names.
Additionally, DNS servers are overlooked choke points, which are as valuable to collect network wide data as firewalls and routers connecting the network to the internet.
In this diary, I would like to introduce a simple shell script to answer one question that in my opinion is quite useful to detect anomalous DNS queries: Which are the top 10 new host names that we looked up today.
First, you need DNS query logs, there are two ways to collect them: you could either enable query logging in your DNS server, or you could just use tcpdump on the DNS server to collect the logs. Query logging works fine for me, but it can put too much strain on a very busy name server. Running tcpdump on the name server, or a sensor monitoring the name server, may work better. We do not have to capture every single query for this technique to work.
First, we need to summarize past queries. In my case, the query logs are rotated hourly, and saved in files with names like query.log.* (* is a number). A sample line from my query logs:
16-Aug-2012 21:42:00.260 queries: info: client 10.5.0.210#54481: query: a1406.g.akamai.net IN A + (192.0.2.1)
To extract the host names, and summarize them, I use the following script:
cat query.log.*|sed -e 's/.*query: //' | cut -f 1 -d' ' | sort | uniq -c | sort -k2 oldlog
This will sort the output by hostname (sort -k2 sorts by the second column), which becomes important later.
Next, I apply the same procedure to the current log:
cat query.log| sed -e 's/.*query: //' | cut -f 1 -d' ' | sort | uniq -c | sort -k2 newlog
Now, we need to find all entries in newlog, that are not included in oldlog. To do so, we use the bash command join, which works pretty much like the SQL command join, but uses the two text files as input. It is important that the join column (the host name) is sorted, which was the reason for the -k 2 argument earlier.
join -1 2 -2 2 -a 2 oldlog newlog combined
-a 2 will include all records from newlog that are not found in oldlog. combined now includes lines from both files, as well as the lines only found in newlog. We need to remove the lines found in both files (which are identified by having two numbers):
cat combined |egrep -v '.* [0-9]+ [0-9]+$' | sort -nr -k2 | head -10
In the end, we sort the host names by frequency, and return the top 10.
To summarize the script for simple copy/paste.I broke some lines up to a
cat $oldlogs | sed -e 's/.*query: //' | cut -f 1 -d' ' | sort | uniq -c | sort -k2 $tmpdir/oldlog
cat $newlog | sed -e 's/.*query: //' | cut -f 1 -d' ' | sort | uniq -c | sort -k2 $tmpdir/newlog
join -1 2 -2 2 -a 2 oldlog newlog | egrep -v '.* [0-9]+ [0-9]+$' | sort -nr -k2 | head -10 $tmpdir/suspects
The file suspects will now include the top 10 suspect domains. For added credit: add the ability to keep a whitelist.
Johannes B. Ullrich, Ph.D.
SANS Technology Institute
(c) SANS Internet Storm Center. http://isc.sans.edu Creative Commons Attribution-Noncommercial 3.0 United States License.