Cosine similarity is a nice technique to measure the similarity between two things, eg: you can use it to see if two documents are similar to each other.
We can use it in cyber security to hunt for different threats, such as: Domain Typosquatting, malicioius powershell commands, etc.
In this article, we will see how to use cosine similarity to detect Domain Typosquatting activities, which is usually used in targeted attacks as well as spear phishing.
"Typosquatting, also called URL hijacking, a sting site, or a fake URL, is a form of cybersquatting, and possibly brandjacking which relies on mistakes such as typos made by Internet users when inputting a website address into a web browser. Should a user accidentally enter an incorrect website address, they may be led to any URL (including an alternative website owned by a cybersquatter)."
Similarity Check
Let's say our domain is called example.com and an adversary wants to target our employees by sending a phishing email with some URLs pointing to a domain like:
examp1e.com
exemple.com
examlpe.com
...
and so on.
To detect such an activity, we can't use the equal operator for comparison, like:
'exemple.com' == 'example.com'
It won't match, but our eyes as humans can see that the two keywords are very similar.
The computer can use some mathematical techniques to check the similarity between the two keywords, a simple technique can be by checking the character frequency.
In fact, the way we are applying the cosine similarity is by creating a vector with the character frequency for each keyword, and then we apply the cosine similarity formula:
#QRadar