Web-Based Attacks Classification

5 min readJun 21, 2022

Source: Italian classification society repels cyber attack

Have you thought about distinguishing between different types of web-based attacks? Imaging during intrusion detection work, the attacks can be classified to garner threat intelligence and the current threat landscape. In addition, the classified result can also be used as a web application firewall to block URL requests. Furthermore, an intrusion protection mechanism can be built by redirecting URL requests classified upfront as malicious toward a web honeypot.

There have existed a lot of attack classification research projects, but it is still hard to distinguish different web-based attacks in practice. When we mention industrial production, we need to consider a lot of factors, especially the time consume and accuracy in real practice. Different methods, especially machine learning technologies, have evolved for many years. Generally, every solution has its own strengths and weaknesses, but there is always a way can balance different considerations and comprise an acceptable method.

In this article, we will introduce one part of our work in the cybersecurity situation awareness system for attack classification. It provides an essential role in attack detection and helps us know about the current threat landscape. Hope you can get some insights and motivate the detection methodology design.

1. The Problems behind

Based on the data we process, which is most of the texts or parameters in url requests, we have narrowed down the problem we handle as multi-classification for short text.

The multi-classification job points to classifying data with more than two classes inside. For example, the classes include dog, cat, mouse, and so on. In our project, the classes mean different types of web-based attacks, like XSS, SQL injection and etc, which we will cover in a later chapter. There are a lot of techniques that can be used to do this work. The first one is the Natural Language Process (NLP). It is the most common method to process text data. One of the most start-of-the-art models is called Bert. It releases the bidirectional capability to learn text messages. The other one is with basic neural networks designed for the multi-classification job combined with related converting methods from text to numeric data. The last one is the usage of graph neural networks, in our case the specific model is the graph convolutional network. The GNN has been so popular recently and has shown its extraordinary performance to analyse topological data.

Another aspect is the short text. Why we stress that part is for the reason of short source data to train. For a URL request, it is composed of protocol, domain, path, and query. The length of the query can be diverse from one to hundreds of characters. But most of the malicious requests begin with short queries and the keywords inside are quite limited. Detecting the malicious short query can help prevent further attempts and protect systems in an early attack stage. The main problem here is short text contains less information than long text. It brings some challenges for some models and also the efficiency.

In order to find how every model reacts to this situation, we design three different experiments and explore the most efficient one in this case.

2. What Attacks to Check

For this attack classification engine, the main dataset source is the URL query part. Almost every web-based attack or more specifically malicious request through URL should be detectable by this engine. So far, limited by the number of payloads we collected, we just include the following attack types:

CRLF Injection
CMD Injection
Open Redirect
Remote/Local File Inclusion
SQl Injection
XSS Injection

3. Preprocessing

We have some sensible considerations here for this step. Firstly, the reasons why we need the process include:

Sanitize unnecessary parts in URL requests to improve the classification performance
Model process numeric data or array instead of text
Special processing from some long length of query — like the repeated long string, some special characters for obfuscation aim

Then, the reason why we did not apply for normal nlp processing here is that some characters are very curial to be treated as an identifier for classification. Finally, there are two main methods we adapt here to do the preprocessing for those malicious scripts.

The first a series of steps include:

NLTK package with regexp_tokenize, work_tokenzie
Urllib package with urlparse, unquote and parse_qsl
Wordbag model

Our aim with those steps is to extract the token from url query and convert those tokens into their frequency number.

The one difference for the second step is on the third model: we used the BertTokenizer in transformers. This model can help transform the tokens into embedding vectors with a pre-trained model.

Last but not least, it is about the imbalance dataset. We explored somo methods to solve that problem. The methods include:

The SMOTE model inside imblearn
Oversampling for minority of classes
Class weights during prediction stage

It turns out that the parameter of class weights has the best practical performance.

4. The Details of Every Experiment

Multi-class classification

label encoder
scaler and random split
imbalance process with class weights
multi-class classification model

https://gist.github.com/Wapiti08/2b888a2a1e7514e9656b31c5bd35c529

Bert model

Bert Model (bert-base-uncased)

https://gist.github.com/Wapiti08/864bf134106d59ccc57f2f7eda045910

GCN

Graph Convolutional Networks (GCN) + Gensim + BertTokenizer
BertTokenizer for tokenization
Sliding window to generate sequence for edges in graphs
Pre-trained model (GoogleNews-vectors-negative300.bin) from Gensim to remove “stop words” — which are not normal stop words in string.
Stellargraph to build the graphs flow
GCN model for graph classification

https://gist.github.com/Wapiti08/f917d46ef528e691604713dadb61d76a

5. Result Comparison

Initial accuracy comparison:

- mutli-class model: 0.7422 (highest 0.9111 )

- bert: 0.0957 (highest 0.5912 )

- gcn + gensim + bert: 0.8619 (0.9862)

Time spent for pre-processing and train (same epochs and data)

- mutli-class model (cpu): around 10m

- bert (tpu): above 9h

- gcn + gensim (cpu): above 7h

6. Conclusion

From our experiments, we can conclude the GNN combination has the highest accuracy while the basic multi-classification model has the shortest time for training and processing. In the practical work, we decided to use a basic multi-classification model by balancing time spent and accuracy. Once we could make great progress on the GNN model to significantly reduce processing time, GNN would be our first choice then. Thanks for your reading and I hope you enjoyed it.