What elements make difference in classification problems

Newt Tan
5 min readMar 16, 2022

--

In a classification project, when the prediction result is not what you desired, like the accuracy for trained model or the final prediction result. Have you thought about what kind of reasons can cause that?

For example, in a binary classification, when you assign different classes with different percentage of data, have you thought about whether the rate you set would have some influence on the final prediction result?

Today, I am gonna give a real scenario to explain this situation. Hopefully, it will help you figure out some problems in your project if you have met the same problem.

The background for this article is when I did an attack classification for http request. There are two classes only, malicious request (1) and benign request (0). The malicious payload includes some common web-based attacks sql injection, xss injection and cmd injection. Some examples are like:

My main idea is to apply natural language process (NLP) to text processing and Binary Classification with deep learning for classification job. In the NLP stage, I did the following steps to generate a proper word embeddings dataset:

  • Sanitize the http request to generate them
  • Tokenization with two separate methods:

Here we use two different methods to tokenize the http requests and some payloads. The reason is that our payloads include the full http requests and only malicious payloads. I used the regexp to match the payloads with url paths in them.

  • Word embeddings to tokens

Here, we give the function two files in order to easily set the percentage of benign requests. The key experimental parameter here is the number (80000) I set to split how many good examples.

  • The main model is like:

I got the accuracy around 98%, here is the print for history:

  • Finally, test on real request

In this test txt file, the indexes from 1–13 (13) are the malicious requests, and 13–24 (11) are the benign requests and the final one(1) is the malicious request. Let’s see the result:

classes prediction
percentage predicted

So we can see the result is quite acceptable. Only one benign example was not classified right.

In the above experiments, the examples ratio is like: 29843: 80000, which is near 0.37. The final accuracy is around 98%. We will take this as the benchmark to explore other variables.

Classes Ratio in dataset

Now we can start to see the difference by setting the ratio to another value. I am gonna do two experiments here: the first one is 29843: 29843 (bigger malicious rate); the second on is 29843: 300000(smaller malicious rate).

For the first one: the final predicted result is the same, there is only a slight difference on the model accuracy, but it is the slight drop down for the accuracy around 97%.

history for ratio 29843: 29843

Then for the second one: it is still the same result and with a higher accuracy for 99%:

From the trials by setting different class ratios in the dataset, we can see there is only a slight difference in the accuracy. So we can assume this is one of the main elements contributing to the final result. The influence is quite limited and totally more data, better accuracy. But we recommend the ratio still should be in a reasonable range like (0.2–0.8).

Preprocessing steps

Next, let’s explore whether the preprocessing steps will influence the result. What I am gonna change is the spe_token_gen function. That function can help handle the http request with only payload inside it. I will delete that function when processing the benign dataset. Let’s take a look at the history first:

We can see some differences here:

  • It takes more epochs to convergence
  • The accuracy has a drop with 10%

Next, we will see the predicted result:

we can see from this output:

  • There is one wrong prediction on the first group (malicious)
  • There is still on wrong prediction on the second group (benign)
  • The final prediction is also wrong (malicious)

From the experiment on the processing section, we do find the difference. It does has a significant influence for the result.

Units number in model

The units are always related to feature number in total. I reduced the the units number from 64 to 32, I did not see a lot of difference on the final prediction result but with a slight drop for the accuracy.

We keep all the other settings as formal options, like the activation functions for the last hidden layers and before hidden layers. We applied normalization and shuffling in the dataset.

In conclusion, to get the desired result, the most important variable is the preprocessing methodology. The other reason like units number and classes ratio also have some influences. Hope this article can help you figure out problems when you find in classification project.

Thanks for your reading.

--

--

Newt Tan

In the end, the inventor is still the hero and always will be. Don’t give up on your dreams. We started with DVDs.