Long Short-Term Memory Network (LSTM) based Phishing Detection Model for E-Mail and SMS with or without URL

Murali Dharmalingam

Murali Dharmalingam Capstone Project by Cohort 6 of the National Cyber Security Scholar Program

Abstract

Phishing is a common tool used by cybercriminals to gain access to systems and to exploit confidential information. The most common exploitation method is the use of phishing emails. Many phishing detection systems utilise machine learning techniques and blacklists. Phishing emails play an important role in phishing detection based ontheir email characteristics, namely URLs and phishing content. In this study, Long Short-Term Memory (LSTM) neural network- based e-mail and SMS phishing detection model using phishing lexicons is proposed. The proposed method consists of five modules: URL filtering, initial labelling, adaptive lexicon formation, dataset labelling and classification using LSTM. The URL filtering involves three steps: index appending, URL filtering from email, and URL filtering from SMS. In index appending, e-mail and SMS datasets are added with the index. The URLs are identified using occurrence protocol, subdomain, rootdomain and TLD and filtered to form a URL dataset based on their respective index. The emails and SMS with URLs were labeled as phishing or legitimate by a stacking classifier. Adaptive lexicon formation is used to label the e-mails and SMS without URLs, and their contents are preprocessed and tokenised. The tokens were lemmatised to remove words with the same meaning to reduce ambiguity. Dataset labelling was performed by the K-means clustering algorithm using the Levenstein distance vector. The Levenstein distance between tokens with email and SMS content was recorded as vectors. The contents of the emails and SMS are clustered and labeled as phishing and legitimate. The LSTM classifier effectively classified phishing and legitimate email and SMS. Datasets for emails and SMS were taken from the Kaggle repository. The performance of the proposed model was evaluated using the accuracy and loss function. The proposed LSTM neural network-based e-mail and SMS phishing detection model achieved the accuracy of 97%.

References

1. Bountakas P, Xenakis C. Helphed: Hybrid ensemble learning phishing email detection. Journal of network
and computer applications. 2023 Jan 1;210:103545.
2. Alani MM, Tawfik H. PhishNot: a cloud-based machine-learning approach to phishing URL detection.
Computer Networks. 2022 Dec 9;218:109407.
3. Islam MK, Al Amin M, Islam MR, Mahbub MN, Showrov MI, Kaushal C. Spam-detection with comparative
analysis and spamming words extractions. In2021 9th International Conference on Reliability, Infocom
Technologies and Optimization (Trends and Future Directions)(ICRITO) 2021 Sep 3 (pp. 1-9). IEEE.