A systematic methodology to evaluating optimised machine learning based network intrusion detection systems
- Authors: Chindove, Hatitye Ethridge
- Date: 2022-10-14
- Subjects: Intrusion detection systems (Computer security) , Machine learning , Computer networks Security measures , Principal components analysis
- Language: English
- Type: Academic theses , Master's theses , text
- Identifier: http://hdl.handle.net/10962/362774 , vital:65361
- Description: A network intrusion detection system (NIDS) is essential for mitigating computer network attacks in various scenarios. However, the increasing complexity of computer networks and attacks makes classifying unseen or novel network traffic challenging. Supervised machine learning techniques (ML) used in a NIDS can be affected by different scenarios. Thus, dataset recency, size, and applicability are essential factors when selecting and tuning a machine learning classifier. This thesis explores developing and optimising several supervised ML algorithms with relatively new datasets constructed to depict real-world scenarios. The methodology includes empirical analyses of systematic ML-based NIDS for a near real-world network system to improve intrusion detection. The thesis is experimental heavy for model assessment. Data preparation methods are explored, followed by feature engineering techniques. The model evaluation process involves three experiments testing against a validation, un-trained, and retrained set. They compare several traditional machine learning and deep learning classifiers to identify the best NIDS model. Results show that the focus on feature scaling, feature selection methods and ML algo- rithm hyper-parameter tuning per model is an essential optimisation component. Distance based ML algorithm performed much better with quantile transformation whilst the tree based algorithms performed better without scaling. Permutation importance performs as a feature selection method compared to feature extraction using Principal Component Analysis (PCA) when applied against all ML algorithms explored. Random forests, Sup- port Vector Machines and recurrent neural networks consistently achieved the best results with high macro f1-score results of 90% 81% and 73% for the CICIDS 2017 dataset; and 72% 68% and 73% against the CICIDS 2018 dataset. , Thesis (MSc) -- Faculty of Science, Computer Science, 2022
- Full Text:
- Date Issued: 2022-10-14
- Authors: Chindove, Hatitye Ethridge
- Date: 2022-10-14
- Subjects: Intrusion detection systems (Computer security) , Machine learning , Computer networks Security measures , Principal components analysis
- Language: English
- Type: Academic theses , Master's theses , text
- Identifier: http://hdl.handle.net/10962/362774 , vital:65361
- Description: A network intrusion detection system (NIDS) is essential for mitigating computer network attacks in various scenarios. However, the increasing complexity of computer networks and attacks makes classifying unseen or novel network traffic challenging. Supervised machine learning techniques (ML) used in a NIDS can be affected by different scenarios. Thus, dataset recency, size, and applicability are essential factors when selecting and tuning a machine learning classifier. This thesis explores developing and optimising several supervised ML algorithms with relatively new datasets constructed to depict real-world scenarios. The methodology includes empirical analyses of systematic ML-based NIDS for a near real-world network system to improve intrusion detection. The thesis is experimental heavy for model assessment. Data preparation methods are explored, followed by feature engineering techniques. The model evaluation process involves three experiments testing against a validation, un-trained, and retrained set. They compare several traditional machine learning and deep learning classifiers to identify the best NIDS model. Results show that the focus on feature scaling, feature selection methods and ML algo- rithm hyper-parameter tuning per model is an essential optimisation component. Distance based ML algorithm performed much better with quantile transformation whilst the tree based algorithms performed better without scaling. Permutation importance performs as a feature selection method compared to feature extraction using Principal Component Analysis (PCA) when applied against all ML algorithms explored. Random forests, Sup- port Vector Machines and recurrent neural networks consistently achieved the best results with high macro f1-score results of 90% 81% and 73% for the CICIDS 2017 dataset; and 72% 68% and 73% against the CICIDS 2018 dataset. , Thesis (MSc) -- Faculty of Science, Computer Science, 2022
- Full Text:
- Date Issued: 2022-10-14
A multispectral and machine learning approach to early stress classification in plants
- Authors: Poole, Louise Carmen
- Date: 2022-04-06
- Subjects: Machine learning , Neural networks (Computer science) , Multispectral imaging , Image processing , Plant stress detection
- Language: English
- Type: Master's thesis , text
- Identifier: http://hdl.handle.net/10962/232410 , vital:49989
- Description: Crop loss and failure can impact both a country’s economy and food security, often to devastating effects. As such, the importance of successfully detecting plant stresses early in their development is essential to minimize spread and damage to crop production. Identification of the stress and the stress-causing agent is the most critical and challenging step in plant and crop protection. With the development of and increase in ease of access to new equipment and technology in recent years, the use of spectroscopy in the early detection of plant diseases has become notably popular. This thesis narrows down the most suitable multispectral imaging techniques and machine learning algorithms for early stress detection. Datasets were collected of visible images and multispectral images. Dehydration was selected as the plant stress type for the main experiments, and data was collected from six plant species typically used in agriculture. Key contributions of this thesis include multispectral and visible datasets showing plant dehydration as well as a separate preliminary dataset on plant disease. Promising results on dehydration showed statistically significant accuracy improvements in the multispectral imaging compared to visible imaging for early stress detection, with multispectral input obtaining a 92.50% accuracy over visible input’s 77.50% on general plant species. The system was effective at stress detection on known plant species, with multispectral imaging introducing greater improvement to early stress detection than advanced stress detection. Furthermore, strong species discrimination was achieved when exclusively testing either early or advanced dehydration against healthy species. , Thesis (MSc) -- Faculty of Science, Ichthyology & Fisheries Sciences, 2022
- Full Text:
- Date Issued: 2022-04-06
- Authors: Poole, Louise Carmen
- Date: 2022-04-06
- Subjects: Machine learning , Neural networks (Computer science) , Multispectral imaging , Image processing , Plant stress detection
- Language: English
- Type: Master's thesis , text
- Identifier: http://hdl.handle.net/10962/232410 , vital:49989
- Description: Crop loss and failure can impact both a country’s economy and food security, often to devastating effects. As such, the importance of successfully detecting plant stresses early in their development is essential to minimize spread and damage to crop production. Identification of the stress and the stress-causing agent is the most critical and challenging step in plant and crop protection. With the development of and increase in ease of access to new equipment and technology in recent years, the use of spectroscopy in the early detection of plant diseases has become notably popular. This thesis narrows down the most suitable multispectral imaging techniques and machine learning algorithms for early stress detection. Datasets were collected of visible images and multispectral images. Dehydration was selected as the plant stress type for the main experiments, and data was collected from six plant species typically used in agriculture. Key contributions of this thesis include multispectral and visible datasets showing plant dehydration as well as a separate preliminary dataset on plant disease. Promising results on dehydration showed statistically significant accuracy improvements in the multispectral imaging compared to visible imaging for early stress detection, with multispectral input obtaining a 92.50% accuracy over visible input’s 77.50% on general plant species. The system was effective at stress detection on known plant species, with multispectral imaging introducing greater improvement to early stress detection than advanced stress detection. Furthermore, strong species discrimination was achieved when exclusively testing either early or advanced dehydration against healthy species. , Thesis (MSc) -- Faculty of Science, Ichthyology & Fisheries Sciences, 2022
- Full Text:
- Date Issued: 2022-04-06
Statistical and Mathematical Learning: an application to fraud detection and prevention
- Authors: Hamlomo, Sisipho
- Date: 2022-04-06
- Subjects: Credit card fraud , Bootstrap (Statistics) , Support vector machines , Neural networks (Computer science) , Decision trees , Machine learning , Cross-validation , Imbalanced data
- Language: English
- Type: Master's thesis , text
- Identifier: http://hdl.handle.net/10962/233795 , vital:50128
- Description: Credit card fraud is an ever-growing problem. There has been a rapid increase in the rate of fraudulent activities in recent years resulting in a considerable loss to several organizations, companies, and government agencies. Many researchers have focused on detecting fraudulent behaviours early using advanced machine learning techniques. However, credit card fraud detection is not a straightforward task since fraudulent behaviours usually differ for each attempt and the dataset is highly imbalanced, that is, the frequency of non-fraudulent cases outnumbers the frequency of fraudulent cases. In the case of the European credit card dataset, we have a ratio of approximately one fraudulent case to five hundred and seventy-eight non-fraudulent cases. Different methods were implemented to overcome this problem, namely random undersampling, one-sided sampling, SMOTE combined with Tomek links and parameter tuning. Predictive classifiers, namely logistic regression, decision trees, k-nearest neighbour, support vector machine and multilayer perceptrons, are applied to predict if a transaction is fraudulent or non-fraudulent. The model's performance is evaluated based on recall, precision, F1-score, the area under receiver operating characteristics curve, geometric mean and Matthew correlation coefficient. The results showed that the logistic regression classifier performed better than other classifiers except when the dataset was oversampled. , Thesis (MSc) -- Faculty of Science, Statistics, 2022
- Full Text:
- Date Issued: 2022-04-06
- Authors: Hamlomo, Sisipho
- Date: 2022-04-06
- Subjects: Credit card fraud , Bootstrap (Statistics) , Support vector machines , Neural networks (Computer science) , Decision trees , Machine learning , Cross-validation , Imbalanced data
- Language: English
- Type: Master's thesis , text
- Identifier: http://hdl.handle.net/10962/233795 , vital:50128
- Description: Credit card fraud is an ever-growing problem. There has been a rapid increase in the rate of fraudulent activities in recent years resulting in a considerable loss to several organizations, companies, and government agencies. Many researchers have focused on detecting fraudulent behaviours early using advanced machine learning techniques. However, credit card fraud detection is not a straightforward task since fraudulent behaviours usually differ for each attempt and the dataset is highly imbalanced, that is, the frequency of non-fraudulent cases outnumbers the frequency of fraudulent cases. In the case of the European credit card dataset, we have a ratio of approximately one fraudulent case to five hundred and seventy-eight non-fraudulent cases. Different methods were implemented to overcome this problem, namely random undersampling, one-sided sampling, SMOTE combined with Tomek links and parameter tuning. Predictive classifiers, namely logistic regression, decision trees, k-nearest neighbour, support vector machine and multilayer perceptrons, are applied to predict if a transaction is fraudulent or non-fraudulent. The model's performance is evaluated based on recall, precision, F1-score, the area under receiver operating characteristics curve, geometric mean and Matthew correlation coefficient. The results showed that the logistic regression classifier performed better than other classifiers except when the dataset was oversampled. , Thesis (MSc) -- Faculty of Science, Statistics, 2022
- Full Text:
- Date Issued: 2022-04-06
A model for recommending related research papers: A natural language processing approach
- Authors: Van Heerden, Juandre Anton
- Date: 2022-04
- Subjects: Machine learning , Artificial intelligence
- Language: English
- Type: Master's theses , text
- Identifier: http://hdl.handle.net/10948/55668 , vital:53405
- Description: The volume of information generated lately has led to information overload, which has impacted researchers’ decision-making capabilities. Researchers have access to a variety of digital libraries to retrieve information. Digital libraries often offer access to a number of journal articles and books. Although digital libraries have search mechanisms it still takes much time to find related research papers. The main aim of this study was to develop a model that uses machine learning techniques to recommend related research papers. The conceptual model was informed by literature on recommender systems in other domains. Furthermore, a literature survey on machine learning techniques helped to identify candidate techniques that could be used. The model comprises four phases. These phases are completed twice, the first time for learning from the data and the second time when a recommendation is sought. The four phases are: (1) identify and remove stopwords, (2) stemming the data, (3) identify the topics for the model, and (4) measuring similarity between documents. The model is implemented and demonstrated using a prototype to recommend research papers using a natural language processing approach. The prototype underwent three iterations. The first iteration focused on understanding the problem domain by exploring how recommender systems and related techniques work. The second iteration focused on pre-processing techniques, topic modeling and similarity measures of two probability distributions. The third iteration focused on refining the prototype, and documenting the lessons learned throughout the process. Practical lessons were learned while finalising the model and constructing the prototype. These practical lessons should help to identify opportunities for future research. , Thesis (MIT) -- Faculty of Engineering the Built Environment and Technology, Information Technology, 2022
- Full Text:
- Date Issued: 2022-04
- Authors: Van Heerden, Juandre Anton
- Date: 2022-04
- Subjects: Machine learning , Artificial intelligence
- Language: English
- Type: Master's theses , text
- Identifier: http://hdl.handle.net/10948/55668 , vital:53405
- Description: The volume of information generated lately has led to information overload, which has impacted researchers’ decision-making capabilities. Researchers have access to a variety of digital libraries to retrieve information. Digital libraries often offer access to a number of journal articles and books. Although digital libraries have search mechanisms it still takes much time to find related research papers. The main aim of this study was to develop a model that uses machine learning techniques to recommend related research papers. The conceptual model was informed by literature on recommender systems in other domains. Furthermore, a literature survey on machine learning techniques helped to identify candidate techniques that could be used. The model comprises four phases. These phases are completed twice, the first time for learning from the data and the second time when a recommendation is sought. The four phases are: (1) identify and remove stopwords, (2) stemming the data, (3) identify the topics for the model, and (4) measuring similarity between documents. The model is implemented and demonstrated using a prototype to recommend research papers using a natural language processing approach. The prototype underwent three iterations. The first iteration focused on understanding the problem domain by exploring how recommender systems and related techniques work. The second iteration focused on pre-processing techniques, topic modeling and similarity measures of two probability distributions. The third iteration focused on refining the prototype, and documenting the lessons learned throughout the process. Practical lessons were learned while finalising the model and constructing the prototype. These practical lessons should help to identify opportunities for future research. , Thesis (MIT) -- Faculty of Engineering the Built Environment and Technology, Information Technology, 2022
- Full Text:
- Date Issued: 2022-04
Applying insights from machine learning towards guidelines for the detection of text-based fake news
- Authors: Ngada, Okuhle
- Date: 2021-12
- Subjects: Machine learning , Fake News
- Language: English
- Type: Master's theses , text
- Identifier: http://hdl.handle.net/10948/60243 , vital:64141
- Description: Web-based technologies have fostered an online environment where information can be disseminated in a fast and cost-effective manner whilst targeting large and diverse audiences. Unfortunately, the rise and evolution of web-based technologies have also created an environment where false information, commonly referred to as “fake news”, spreads rapidly. The effects of this spread can be catastrophic. Finding solutions to the problem of fake news is complicated for a myriad of reasons, such as: what is defined as fake news, the lack of quality datasets available to researchers, the topics covered in such data, and the fact that datasets exist in a variety of languages. The effects of false information dissemination can result in reputational damage, financial damage to affected brands, and ultimately, misinformed online news readers who can make misinformed decisions. The objective of the study is to propose a set of guidelines that can be used by other system developers to implement misinformation detection tools and systems. The guidelines are constructed using findings from the experimentation phase of the project and information uncovered in the literature review conducted as part of the study. A selection of machine and deep learning approaches are examined to test the applicability of cues that could separate fake online articles from real online news articles. Key performance metrics such as precision, recall, accuracy, F1-score, and ROC are used to measure the performance of the selected machine learning and deep learning models. To demonstrate the practicality of the guidelines and allow for reproducibility of the research, each guideline provides background information relating to the identified problem, a solution to the problem through pseudocode, code excerpts using the Python programming language, and points of consideration that may assist with the implementation. , Thesis (MA) --Faculty of Engineering, the Built Environment, and Technology, 2021
- Full Text:
- Date Issued: 2021-12
Applying insights from machine learning towards guidelines for the detection of text-based fake news
- Authors: Ngada, Okuhle
- Date: 2021-12
- Subjects: Machine learning , Fake News
- Language: English
- Type: Master's theses , text
- Identifier: http://hdl.handle.net/10948/60243 , vital:64141
- Description: Web-based technologies have fostered an online environment where information can be disseminated in a fast and cost-effective manner whilst targeting large and diverse audiences. Unfortunately, the rise and evolution of web-based technologies have also created an environment where false information, commonly referred to as “fake news”, spreads rapidly. The effects of this spread can be catastrophic. Finding solutions to the problem of fake news is complicated for a myriad of reasons, such as: what is defined as fake news, the lack of quality datasets available to researchers, the topics covered in such data, and the fact that datasets exist in a variety of languages. The effects of false information dissemination can result in reputational damage, financial damage to affected brands, and ultimately, misinformed online news readers who can make misinformed decisions. The objective of the study is to propose a set of guidelines that can be used by other system developers to implement misinformation detection tools and systems. The guidelines are constructed using findings from the experimentation phase of the project and information uncovered in the literature review conducted as part of the study. A selection of machine and deep learning approaches are examined to test the applicability of cues that could separate fake online articles from real online news articles. Key performance metrics such as precision, recall, accuracy, F1-score, and ROC are used to measure the performance of the selected machine learning and deep learning models. To demonstrate the practicality of the guidelines and allow for reproducibility of the research, each guideline provides background information relating to the identified problem, a solution to the problem through pseudocode, code excerpts using the Python programming language, and points of consideration that may assist with the implementation. , Thesis (MA) --Faculty of Engineering, the Built Environment, and Technology, 2021
- Full Text:
- Date Issued: 2021-12
- «
- ‹
- 1
- ›
- »