Midterm Report

Introduction/Background

In this era of the internet and technology we have seen a wide increase in fraudulent and phishy websites where the site owner/attacker attempts to gain sensitive information from the user for his/ her own commercial benefit. This has attracted the interest of many to apply machine learning modeling to automatically identify which websites are scams or not based on a particular website’s characteristics and features. In our project, we will take features such as a website’s url length, the types of symbols the url contains, and number of MX servers and apply machine learning techniques to produce whether a website is fraudulent or not.

Problem Definition

Oftentimes, we may receive an email or see a website that grabs our attention. However, how do we know whether we can trust its information? Because of the rising use of the internet, there has been an unfortunate increase in fraudulent sites that compel users to disclose the wrong information as well as have their money stolen. Even though an average young adult can tell something may or may not be a scam because some things are “too good to be true”, many fraudulent websites are starting to look like legitimate sites. This can be especially problematic for the elderly and younger kids who may not understand what the normal of the internet is. Consequently, in our project, we will be using machine learning techniques to identify suspicious websites to protect ourselves from fraud and its negative consequences.

Data Collection

A research study was conducted on a large number of websites on the internet which identified whether a website was scam or legitimate. The study had two variations of datasets: a small dataset with 58,645 websites and a large dataset of 88,647 websites. Each dataset has 111 features which can be divided into six different categories:

Whole URL - This identifies the number of a specific attribute inside the whole url of a website. For example, the number of “.” signs in any url is a feature inside the dataset
Domain of URL - This identifies the number of a specific attribute inside the domain url of a website. For example, if a website was “goog@e.com\fis@ing\sc@mwebiste”, a value of 1 would be given for the number of “@” signs in the domain url because there is only 1 “@” in “goog@e.com”
URL directory - This identifies the number of a specific attribute inside the directory url of a website. Taking the example from point 2, “goog@e.com\fis@ing\sc@mwebiste” would have a value of 2 for the number of “@” signs in the directory url because “fis@ing\sc@mwebiste” has 2 “@” signs.
URL filename properties - This identifies the number of a specific attribute inside the filename of a url if it contains one. For example, if a website was “drive.google.com\drive\CS_4641.php”, the value of the number of “_” signs in the url filename would be 1 because CS_4641.php has 1 “_” sign.
URL parameter properties - This identifies the number of a specific attribute inside the parameter of a url if it contains one. For example, if a website was “espn/com\rankings?player=Lebron James&year=2016”, the value of the number of “&” signs would be 1 because of “?player=Lebron James&year=2016” having 1 “&” sign.
The URL of resolving data and external metrics - This identities the features that cannot directly be seen in a website’s URL. For example, the number of redirects the websites goes through when clicked is a feature under this category.

A list of all 111 features with a short description of each of their descriptions are given in detail in the linked research article. Taking in all 111 features a boolean Y value (0,1) was given to identify if a website was a scam site or a legitimate site. For our data analysis, we used the larger dataset of n = 88,647 and d = 111 because a larger dataset was more appropriate to the actual number of websites on the internet.

Methods

To avoid overfitting and underfitting, we made 80% of our dataset our training data and 20% our testing dataset and tested our models separately with them. To reduce the dimensions of our data, we applied various methods such as Principal Component Analysis (PCA), Tree Based Feature Selection, and Feature Permutations. After calculating the most important features in our dataset, we trained our model using the Naive Bayes Approach and Random Forests.

Results and Discussion

Pre-Processing

The accurate detection of scam websites involves several varying features such as types of characters found in the url and the number of MX servers. However, many of these features will not be significant when detecting scam websites and many will be correlated to each other; the additional classification from some features will be marginal and may cause overfitting when building a model. Nevertheless, we first attempted to reduce the dimensions by applying Principal Component Analysis. Although, we were able to conclude that 99% of the model could be predicted by 2 features, PCA didn’t take into consideration the y labels of our data. PCA simply maximizes the variance of a given dataset. However, even if the data gets projected onto an axis and separated, the data might not be separable by class. Therefore, we moved on to a forward feature selection technique called Tree-Based Feature Selection as a pre-processing step. Essentially, all the data points are placed at the root of a tree. The tree expands by splitting the data among different features. Each iteration looks at which feature decreases the impurity the most (i.e trying to get the data in each node to be homogenous); impurity is measured through a combination of entropy and the Gini index. Subsequent splits decrease the impurity further until ideally, every leaf is either fully "spam" or fully "not spam". This was the rigorous classification approach we were looking for. We were able to reduce the dimensions of the dataset. Furthermore the features that were found were completely different from the PCA approach. Below is a summary table of the results:

Number of Features Reduced To	Fractional Importance	Important Feature Names
2	74.5%	Directory Length Time_Domain_Activation
3	77.7%	Directory Length Time_Domain_Activation ASN_IP
4	80.3%	Directory Length Time_Domain_Activation ASN_IP Time_Response

Table 1: Summary of Tree Based Feature Selection

We considered that impurity-based feature importances may be misleading for high-cardinality features (features which may have many unique values). In other words, there is more information present for features with several different values, thereby ranking those features as more important than ones with less cardinality. Our dataset was susceptible to this, as the number of certain characters within a url may vary tremendously between websites. Therefore we tried a backward feature selection approach called Feature Permutations. In feature permutations, a model score with all the features is calculated. Then a feature is removed, and a new model score is calculated. The difference between the two values is the importance of that feature to the model. This is repeated for all the features.

Through this approach, the importance of a feature reflects its contributions to the entire model, not its intrinsic predictive value. This approach is more costly in terms of time but it was needed to remove the bias from our pre-processing. The model score is based on an external evaluation. We chose to test accuracy, precision, and recall and generate three different feature permutation models. Below is a summary of our results.

Number of Features Reduced To	Measure Focus	Fractional Importance	Import Feature Names
2	Accuracy	71.01%	Directory Length Time_Domain_Activation
2	Precision	64.8%	Directory Length Time_Domain_Activation
2	Recall	76.2%	Directory Length Time_Domain_Activation
4	Accuracy	77.3%	Directory Length Time_Domain_Activation Qty_Dot_Domain Length_ URL
4	Precision	74.01%	Directory Length Time_Domain_Activation ASN_IP Qty_Dot_Domain
4	Recall	82%	Directory Length Time_Domain_Activation Qty_Dot_Domain Length_ URL

Table 2: Summary of Permutation Feature Selection

Through all of our tests of feature permutation and impurity-based feature importance, directory_length" and "time_domain_activation" are the most important features, so we visualized the model by only focusing on these two parameters for now. Below is a scatter plot which shows all the training set data points plotted against these two features.

Looking at the visual, one cannot draw a rough estimate line separating a spam website versus a non-spam website; therefore, more features will be required when building the model. To determine whether a certain number of features is enough, we need to train the model and test its accuracy. Since our y label is a boolean value that can only be 0 and 1, using linear regression would not be a good idea. Therefore, we tested the model using the Naive Bayes Approach, which is perfect for a dataset like ours.

Training Using Gaussian Naive Bayes Approach

In this part, we trained our dataset using a Gaussian Naive Bayes Approach on both the Feature Permutation reduced training dataset and Tree Based Feature Selection reduced training dataset. We measured the accuracy of the model by calculating the percentage of correctly labeled points each model gave out. Below is a summary of the results:

Data Type	Method	Accuracy
Training Data	Feature Permutation	84.0%
Test Data	Feature Permutation	84.5%
Training Data	Tree Based Selection	82.0%
Test Data	Tree Based Selection	53.2%

Table 3: Naive Bayes Approach Table

Training Using Random Forests

Next, we attempted to use random forests to train our data. These essentially take random data points and then based on the number of selected data points they grow decision trees with more sample data points until each "branch" is pure, or they are leaves. Below is a summary of the results:

Data Type	Method	Accuracy
Training Data	Feature Permutation	97.3%
Test Data	Feature Permutation	93.6%
Training Data	Tree Based Selection	99.4%
Test Data	Tree Based Selection	33.2%

Table 4: Random Forests Approach Table

From the two methods, we were quickly able to see that the tree based model for feature selection was overfitting the data because of the high accuracy decrease between the training and test data for method training methods. Consequently, we decided to move forward with the Feature Permutation approach for our dimension reduction. As we move forward with the project, we will focus on visualizing and polishing the trained model by implementing polynomial/logistic regression and k-fold cross validation to aid with overfitting/ underfitting. Additionally, we may modify some of the parameters in the naive gaussian bayes implementation, or use another form of naive bayes entirely.

Contribution Table

Midterm Report Contribution
Dataset final search	Hassan
Transfer Content to Github pages	Seo Hyun
Format editing and data collection	Hassan
Feature Reduction	Abdulrahman
ML Model Research	Seo Hyun
Generate ML Model and Test model accuracy	Seo Hyun
Written Content Modification (Introduction, Methods, Results)	Hassan and Abdulrahman
Methods Results and Discussion	Hassan
Create visualizations	Abdulrahman

References

Abbasi, A., Chen, H. A comparison of fraud cues and classification methods for fake escrow website detection. Inf Technol Manag 10, 83–101 (2009). https://doi.org/10.1007/s10799-009-0059-0

Abbasi, A., Zhang, Z., Zimbra, D., Chen, H., & Nunamaker, J. F. (2010). Detecting Fake Websites: The Contribution of Statistical Learning Theory. MIS Quarterly, 34(3), 435–461. https://doi.org/10.2307/25750686

Afanasyeva, O., Shiyan, V., & Goncharova, M. (2021). Cyber Fraud as a Relevant Internet of Things Security Threat. Proceedings of the International Scientific and Practical Conference on Computer and Information Security - Volume 1: INFSEC, 122–126. doi:10.5220/0010619600003170

Vrbančič, G., Fister, I., & Podgorelec, V. (2020). Datasets for phishing websites detection . Data in Brief, 33, 106438. doi:10.1016/j.dib.2020.106438

Files

Link to ipynb file