Application of machine learning, molecular modelling and structural data mining against antiretroviral drug resistance in HIV-1
- Sheik Amamuddy, Olivier Serge André
- Authors: Sheik Amamuddy, Olivier Serge André
- Date: 2020
- Subjects: Machine learning , Molecules -- Models , Data mining , Neural networks (Computer science) , Antiretroviral agents , Protease inhibitors , Drug resistance , Multidrug resistance , Molecular dynamics , Renin-angiotensin system , HIV (Viruses) -- South Africa , HIV (Viruses) -- Social aspects -- South Africa , South African Natural Compounds Database
- Language: English
- Type: text , Thesis , Doctoral , PhD
- Identifier: http://hdl.handle.net/10962/115964 , vital:34282
- Description: Millions are affected with the Human Immunodeficiency Virus (HIV) world wide, even though the death toll is on the decline. Antiretrovirals (ARVs), more specifically protease inhibitors have shown tremendous success since their introduction into therapy since the mid 1990’s by slowing down progression to the Acquired Immune Deficiency Syndrome (AIDS). However, Drug Resistance Mutations (DRMs) are constantly selected for due to viral adaptation, making drugs less effective over time. The current challenge is to manage the infection optimally with a limited set of drugs, with differing associated levels of toxicities in the face of a virus that (1) exists as a quasispecies, (2) may transmit acquired DRMs to drug-naive individuals and (3) that can manifest class-wide resistance due to similarities in design. The presence of latent reservoirs, unawareness of infection status, education and various socio-economic factors make the problem even more complex. Adequate timing and choice of drug prescription together with treatment adherence are very important as drug toxicities, drug failure and sub-optimal treatment regimens leave room for further development of drug resistance. While CD4 cell count and the determination of viral load from patients in resource-limited settings are very helpful to track how well a patient’s immune system is able to keep the virus in check, they can be lengthy in determining whether an ARV is effective. Phenosense assay kits answer this problem using viruses engineered to contain the patient sequences and evaluating their growth in the presence of different ARVs, but this can be expensive and too involved for routine checks. As a cheaper and faster alternative, genotypic assays provide similar information from HIV pol sequences obtained from blood samples, inferring ARV efficacy on the basis of drug resistance mutation patterns. However, these are inherently complex and the various methods of in silico prediction, such as Geno2pheno, REGA and Stanford HIVdb do not always agree in every case, even though this gap decreases as the list of resistance mutations is updated. A major gap in HIV treatment is that the information used for predicting drug resistance is mainly computed from data containing an overwhelming majority of B subtype HIV, when these only comprise about 12% of the worldwide HIV infections. In addition to growing evidence that drug resistance is subtype-related, it is intuitive to hypothesize that as subtyping is a phylogenetic classification, the more divergent a subtype is from the strains used in training prediction models, the less their resistance profiles would correlate. For the aforementioned reasons, we used a multi-faceted approach to attack the virus in multiple ways. This research aimed to (1) improve resistance prediction methods by focusing solely on the available subtype, (2) mine structural information pertaining to resistance in order to find any exploitable weak points and increase knowledge of the mechanistic processes of drug resistance in HIV protease. Finally, (3) we screen for protease inhibitors amongst a database of natural compounds [the South African natural compound database (SANCDB)] to find molecules or molecular properties usable to come up with improved inhibition against the drug target. In this work, structural information was mined using the Anisotropic Network Model, Dynamics Cross-Correlation, Perturbation Response Scanning, residue contact network analysis and the radius of gyration. These methods failed to give any resistance-associated patterns in terms of natural movement, internal correlated motions, residue perturbation response, relational behaviour and global compaction respectively. Applications of drug docking, homology-modelling and energy minimization for generating features suitable for machine-learning were not very promising, and rather suggest that the value of binding energies by themselves from Vina may not be very reliable quantitatively. All these failures lead to a refinement that resulted in a highly sensitive statistically-guided network construction and analysis, which leads to key findings in the early dynamics associated with resistance across all PI drugs. The latter experiment unravelled a conserved lateral expansion motion occurring at the flap elbows, and an associated contraction that drives the base of the dimerization domain towards the catalytic site’s floor in the case of drug resistance. Interestingly, we found that despite the conserved movement, bond angles were degenerate. Alongside, 16 Artificial Neural Network models were optimised for HIV proteases and reverse transcriptase inhibitors, with performances on par with Stanford HIVdb. Finally, we prioritised 9 compounds with potential protease inhibitory activity using virtual screening and molecular dynamics (MD) to additionally suggest a promising modification to one of the compounds. This yielded another molecule inhibiting equally well both opened and closed receptor target conformations, whereby each of the compounds had been selected against an array of multi-drug-resistant receptor variants. While a main hurdle was a lack of non-B subtype data, our findings, especially from the statistically-guided network analysis, may extrapolate to a certain extent to them as the level of conservation was very high within subtype B, despite all the present variations. This network construction method lays down a sensitive approach for analysing a pair of alternate phenotypes for which complex patterns prevail, given a sufficient number of experimental units. During the course of research a weighted contact mapping tool was developed to compare renin-angiotensinogen variants and packaged as part of the MD-TASK tool suite. Finally the functionality, compatibility and performance of the MODE-TASK tool were evaluated and confirmed for both Python2.7.x and Python3.x, for the analysis of normals modes from single protein structures and essential modes from MD trajectories. These techniques and tools collectively add onto the conventional means of MD analysis.
- Full Text:
- Authors: Sheik Amamuddy, Olivier Serge André
- Date: 2020
- Subjects: Machine learning , Molecules -- Models , Data mining , Neural networks (Computer science) , Antiretroviral agents , Protease inhibitors , Drug resistance , Multidrug resistance , Molecular dynamics , Renin-angiotensin system , HIV (Viruses) -- South Africa , HIV (Viruses) -- Social aspects -- South Africa , South African Natural Compounds Database
- Language: English
- Type: text , Thesis , Doctoral , PhD
- Identifier: http://hdl.handle.net/10962/115964 , vital:34282
- Description: Millions are affected with the Human Immunodeficiency Virus (HIV) world wide, even though the death toll is on the decline. Antiretrovirals (ARVs), more specifically protease inhibitors have shown tremendous success since their introduction into therapy since the mid 1990’s by slowing down progression to the Acquired Immune Deficiency Syndrome (AIDS). However, Drug Resistance Mutations (DRMs) are constantly selected for due to viral adaptation, making drugs less effective over time. The current challenge is to manage the infection optimally with a limited set of drugs, with differing associated levels of toxicities in the face of a virus that (1) exists as a quasispecies, (2) may transmit acquired DRMs to drug-naive individuals and (3) that can manifest class-wide resistance due to similarities in design. The presence of latent reservoirs, unawareness of infection status, education and various socio-economic factors make the problem even more complex. Adequate timing and choice of drug prescription together with treatment adherence are very important as drug toxicities, drug failure and sub-optimal treatment regimens leave room for further development of drug resistance. While CD4 cell count and the determination of viral load from patients in resource-limited settings are very helpful to track how well a patient’s immune system is able to keep the virus in check, they can be lengthy in determining whether an ARV is effective. Phenosense assay kits answer this problem using viruses engineered to contain the patient sequences and evaluating their growth in the presence of different ARVs, but this can be expensive and too involved for routine checks. As a cheaper and faster alternative, genotypic assays provide similar information from HIV pol sequences obtained from blood samples, inferring ARV efficacy on the basis of drug resistance mutation patterns. However, these are inherently complex and the various methods of in silico prediction, such as Geno2pheno, REGA and Stanford HIVdb do not always agree in every case, even though this gap decreases as the list of resistance mutations is updated. A major gap in HIV treatment is that the information used for predicting drug resistance is mainly computed from data containing an overwhelming majority of B subtype HIV, when these only comprise about 12% of the worldwide HIV infections. In addition to growing evidence that drug resistance is subtype-related, it is intuitive to hypothesize that as subtyping is a phylogenetic classification, the more divergent a subtype is from the strains used in training prediction models, the less their resistance profiles would correlate. For the aforementioned reasons, we used a multi-faceted approach to attack the virus in multiple ways. This research aimed to (1) improve resistance prediction methods by focusing solely on the available subtype, (2) mine structural information pertaining to resistance in order to find any exploitable weak points and increase knowledge of the mechanistic processes of drug resistance in HIV protease. Finally, (3) we screen for protease inhibitors amongst a database of natural compounds [the South African natural compound database (SANCDB)] to find molecules or molecular properties usable to come up with improved inhibition against the drug target. In this work, structural information was mined using the Anisotropic Network Model, Dynamics Cross-Correlation, Perturbation Response Scanning, residue contact network analysis and the radius of gyration. These methods failed to give any resistance-associated patterns in terms of natural movement, internal correlated motions, residue perturbation response, relational behaviour and global compaction respectively. Applications of drug docking, homology-modelling and energy minimization for generating features suitable for machine-learning were not very promising, and rather suggest that the value of binding energies by themselves from Vina may not be very reliable quantitatively. All these failures lead to a refinement that resulted in a highly sensitive statistically-guided network construction and analysis, which leads to key findings in the early dynamics associated with resistance across all PI drugs. The latter experiment unravelled a conserved lateral expansion motion occurring at the flap elbows, and an associated contraction that drives the base of the dimerization domain towards the catalytic site’s floor in the case of drug resistance. Interestingly, we found that despite the conserved movement, bond angles were degenerate. Alongside, 16 Artificial Neural Network models were optimised for HIV proteases and reverse transcriptase inhibitors, with performances on par with Stanford HIVdb. Finally, we prioritised 9 compounds with potential protease inhibitory activity using virtual screening and molecular dynamics (MD) to additionally suggest a promising modification to one of the compounds. This yielded another molecule inhibiting equally well both opened and closed receptor target conformations, whereby each of the compounds had been selected against an array of multi-drug-resistant receptor variants. While a main hurdle was a lack of non-B subtype data, our findings, especially from the statistically-guided network analysis, may extrapolate to a certain extent to them as the level of conservation was very high within subtype B, despite all the present variations. This network construction method lays down a sensitive approach for analysing a pair of alternate phenotypes for which complex patterns prevail, given a sufficient number of experimental units. During the course of research a weighted contact mapping tool was developed to compare renin-angiotensinogen variants and packaged as part of the MD-TASK tool suite. Finally the functionality, compatibility and performance of the MODE-TASK tool were evaluated and confirmed for both Python2.7.x and Python3.x, for the analysis of normals modes from single protein structures and essential modes from MD trajectories. These techniques and tools collectively add onto the conventional means of MD analysis.
- Full Text:
Guidelines for secure cloud-based personal health records
- Authors: Mxoli, Ncedisa Avuya Mercia
- Date: 2017
- Subjects: Cloud computing -- Security measures , Computer security , Data mining , Medical records -- Data processing
- Language: English
- Type: Thesis , Masters , MTech
- Identifier: http://hdl.handle.net/10948/14134 , vital:27433
- Description: Traditionally, health records have been stored in paper folders at the physician’s consulting rooms – or at the patient’s home. Some people stored the health records of their family members, so as to keep a running history of all the medical procedures they went through, and what medications they were given by different physicians at different stages of their lives. Technology has introduced better and safer ways of storing these records, namely, through the use of Personal Health Records (PHRs). With time, different types of PHRs have emerged, i.e. local, remote server-based, and hybrid PHRs. Web-based PHRs fall under the remote server-based PHRs; and recently, a new market in storing PHRs has emerged. Cloud computing has become a trend in storing PHRs in a more accessible and efficient manner. Despite its many benefits, cloud computing has many privacy and security concerns. As a result, the adoption rate of cloud services is not yet very high. A qualitative and exploratory research design approach was followed in this study, in order to reach the objective of proposing guidelines that could assist PHR providers in selecting a secure Cloud Service Provider (CSP) to store their customers’ health data. The research methods that were used include a literature review, systematic literature review, qualitative content analysis, reasoning, argumentation and elite interviews. A systematic literature review and qualitative content analysis were conducted to examine those risks in the cloud environment that could have a negative impact on the secure storing of PHRs. PHRs must satisfy certain dimensions, in order for them to be meaningful for use. While these were highlighted in the research, it also emerged that certain risks affect the PHR dimensions directly, thus threatening the meaningfulness and usability of cloud-based PHRs. The literature review revealed that specific control measures can be adopted to mitigate the identified risks. These control measures form part of the material used in this study to identify the guidelines for secure cloud-based PHRs. The guidelines were formulated through the use of reasoning and argumentation. After the guidelines were formulated, elite interviews were conducted, in order to validate and finalize the main research output: i.e. guidelines. The results of this study may alert PHR providers to the risks that exist in the cloud environment; so that they can make informed decisions when choosing a CSP for storing their customers’ health data.
- Full Text:
- Authors: Mxoli, Ncedisa Avuya Mercia
- Date: 2017
- Subjects: Cloud computing -- Security measures , Computer security , Data mining , Medical records -- Data processing
- Language: English
- Type: Thesis , Masters , MTech
- Identifier: http://hdl.handle.net/10948/14134 , vital:27433
- Description: Traditionally, health records have been stored in paper folders at the physician’s consulting rooms – or at the patient’s home. Some people stored the health records of their family members, so as to keep a running history of all the medical procedures they went through, and what medications they were given by different physicians at different stages of their lives. Technology has introduced better and safer ways of storing these records, namely, through the use of Personal Health Records (PHRs). With time, different types of PHRs have emerged, i.e. local, remote server-based, and hybrid PHRs. Web-based PHRs fall under the remote server-based PHRs; and recently, a new market in storing PHRs has emerged. Cloud computing has become a trend in storing PHRs in a more accessible and efficient manner. Despite its many benefits, cloud computing has many privacy and security concerns. As a result, the adoption rate of cloud services is not yet very high. A qualitative and exploratory research design approach was followed in this study, in order to reach the objective of proposing guidelines that could assist PHR providers in selecting a secure Cloud Service Provider (CSP) to store their customers’ health data. The research methods that were used include a literature review, systematic literature review, qualitative content analysis, reasoning, argumentation and elite interviews. A systematic literature review and qualitative content analysis were conducted to examine those risks in the cloud environment that could have a negative impact on the secure storing of PHRs. PHRs must satisfy certain dimensions, in order for them to be meaningful for use. While these were highlighted in the research, it also emerged that certain risks affect the PHR dimensions directly, thus threatening the meaningfulness and usability of cloud-based PHRs. The literature review revealed that specific control measures can be adopted to mitigate the identified risks. These control measures form part of the material used in this study to identify the guidelines for secure cloud-based PHRs. The guidelines were formulated through the use of reasoning and argumentation. After the guidelines were formulated, elite interviews were conducted, in order to validate and finalize the main research output: i.e. guidelines. The results of this study may alert PHR providers to the risks that exist in the cloud environment; so that they can make informed decisions when choosing a CSP for storing their customers’ health data.
- Full Text:
Using data mining techniques for the prediction of student dropouts from university science programs
- Authors: Vambe, William Tichaona
- Date: 2016
- Subjects: Data mining , Dropout behavior, Prediction of
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: http://hdl.handle.net/10353/12314 , vital:39252
- Description: Data Mining has taken a center stage in education for addressing student dropout challenges as it has become one of the major threat affecting Higher Educational Institutes (HEIs). Being able to predict students who are likely to dropout helps the university to assist those facing challenges early. This will results in producing more graduates with the intellectual capital who will provide skills in the industries, hence addressing the major challenge of skill shortage being faced in South Africa. Studies and researches as purported in literature have been done to address this major threat of dropout challenge by using the theoretical approach which banked on Tinto’s model, followed by the traditional and statistical approach. However, the two lacked accuracy and the automation aspect which makes them difficult and time-consuming to use as they require to be tested periodically for them to be validated. Recently data mining has become a vital tool for predicting non-linear phenomenon including where there is missing data and bringing about accuracy and automation aspect. Data mining usefulness and reliability assessment in education made it possible to be used for prediction by different researchers. As such this research used data mining approach that integrates classification and prediction techniques to analyze student academic data at the University of Fort Hare to create a model for student dropout using preentry data and university academic performance of each student. Following Knowledge Discovery from Database (KDD) framework, data for the students enrolled in the Bachelor of Science programs between 2003 and 2014 was selected. It went through preprocessing and transformation as to deal with the missing data and noise data. Classification algorithms were then used for student characterization. Decision trees (J48) which are found in Weka software were used to build the model for data mining and prediction. The reason for choosing decision trees was it’s ability to deal with textual, nominal and numeric data as was the case with our input data and because they have good precision.The model was then trained using a train data set, validated and evaluated with another data set. Experimental results demonstrations that data mining is useful in predicting students who have chances to drop out. A critical analysis of correctly classifying instances, the confusion matrix and ROC area shows that the model can correctly classify and predict those who are likely to dropout. The model accuracy was 66percent which is a good percentage as supported in literature which means the results produced can be reliably used for assessment and make strategic decisions. Furthermore, the model took a matter of seconds to compute the results when supplied with 400 instances which prove that it is effective and efficient. Grounding our conclusion from these experimental results, this research proved that Data Mining is useful for bringing about automation, accuracy in prediction of student dropouts and the results can be reliably depended on for decision making by faculty managers who are the decision makers.
- Full Text:
Using data mining techniques for the prediction of student dropouts from university science programs
- Authors: Vambe, William Tichaona
- Date: 2016
- Subjects: Data mining , Dropout behavior, Prediction of
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: http://hdl.handle.net/10353/12314 , vital:39252
- Description: Data Mining has taken a center stage in education for addressing student dropout challenges as it has become one of the major threat affecting Higher Educational Institutes (HEIs). Being able to predict students who are likely to dropout helps the university to assist those facing challenges early. This will results in producing more graduates with the intellectual capital who will provide skills in the industries, hence addressing the major challenge of skill shortage being faced in South Africa. Studies and researches as purported in literature have been done to address this major threat of dropout challenge by using the theoretical approach which banked on Tinto’s model, followed by the traditional and statistical approach. However, the two lacked accuracy and the automation aspect which makes them difficult and time-consuming to use as they require to be tested periodically for them to be validated. Recently data mining has become a vital tool for predicting non-linear phenomenon including where there is missing data and bringing about accuracy and automation aspect. Data mining usefulness and reliability assessment in education made it possible to be used for prediction by different researchers. As such this research used data mining approach that integrates classification and prediction techniques to analyze student academic data at the University of Fort Hare to create a model for student dropout using preentry data and university academic performance of each student. Following Knowledge Discovery from Database (KDD) framework, data for the students enrolled in the Bachelor of Science programs between 2003 and 2014 was selected. It went through preprocessing and transformation as to deal with the missing data and noise data. Classification algorithms were then used for student characterization. Decision trees (J48) which are found in Weka software were used to build the model for data mining and prediction. The reason for choosing decision trees was it’s ability to deal with textual, nominal and numeric data as was the case with our input data and because they have good precision.The model was then trained using a train data set, validated and evaluated with another data set. Experimental results demonstrations that data mining is useful in predicting students who have chances to drop out. A critical analysis of correctly classifying instances, the confusion matrix and ROC area shows that the model can correctly classify and predict those who are likely to dropout. The model accuracy was 66percent which is a good percentage as supported in literature which means the results produced can be reliably used for assessment and make strategic decisions. Furthermore, the model took a matter of seconds to compute the results when supplied with 400 instances which prove that it is effective and efficient. Grounding our conclusion from these experimental results, this research proved that Data Mining is useful for bringing about automation, accuracy in prediction of student dropouts and the results can be reliably depended on for decision making by faculty managers who are the decision makers.
- Full Text:
The impact of domain knowledge-driven variable derivation on classifier performance for corporate data mining
- Authors: Welcker, Laura Joana Maria
- Date: 2015
- Subjects: Data mining , Business -- Data processing , Database management
- Language: English
- Type: Thesis , Doctoral , DPhil
- Identifier: http://hdl.handle.net/10948/5009 , vital:20778
- Description: The technological progress in terms of increasing computational power and growing virtual space to collect data offers great potential for businesses to benefit from data mining applications. Data mining can create a competitive advantage for corporations by discovering business relevant information, such as patterns, relationships, and rules. The role of the human user within the data mining process is crucial, which is why the research area of domain knowledge becomes increasingly important. This thesis investigates the impact of domain knowledge-driven variable derivation on classifier performance for corporate data mining. Domain knowledge is defined as methodological, data and business know-how. The thesis investigates the topic from a new perspective by shifting the focus from a one-sided approach, namely a purely analytic or purely theoretical approach towards a target group-oriented (researcher and practitioner) approach which puts the methodological aspect by means of a scientific guideline in the centre of the research. In order to ensure feasibility and practical relevance of the guideline, it is adapted and applied to the requirements of a practical business case. Thus, the thesis examines the topic from both perspectives, a theoretical and practical perspective. Therewith, it overcomes the limitation of a one-sided approach which mostly lacks practical relevance or generalisability of the results. The primary objective of this thesis is to provide a scientific guideline which should enable both practitioners and researchers to move forward the domain knowledge-driven research for variable derivation on a corporate basis. In the theoretical part, a broad overview of the main aspects which are necessary to undertake the research are given, such as the concept of domain knowledge, the data mining task of classification, variable derivation as a subtask of data preparation, and evaluation techniques. This part of the thesis refers to the methodological aspect of domain knowledge. In the practical part, a research design is developed for testing six hypotheses related to domain knowledge-driven variable derivation. The major contribution of the empirical study is concerned with testing the impact of domain knowledge on a real business data set compared to the impact of a standard and randomly derived data set. The business application of the research is a binary classification problem in the domain of an insurance business, which deals with the prediction of damages in legal expenses insurances. Domain knowledge is expressed through deriving the corporate variables by means of the business and data-driven constructive induction strategy. Six variable derivation steps are investigated: normalisation, instance relation, discretisation, categorical encoding, ratio, and multivariate mathematical function. The impact of the domain knowledge is examined by pairwise (with and without derived variables) performance comparisons for five classification techniques (decision trees, naive Bayes, logistic regression, artificial neural networks, k-nearest neighbours). The impact is measured by two classifier performance criteria: sensitivity and area under the ROC-curve (AUC). The McNemar significance test is used to verify the results. Based on the results, two hypotheses are clearly verified and accepted, three hypotheses are partly verified, and one hypothesis had to be rejected on the basis of the case study results. The thesis reveals a significant positive impact of domain knowledge-driven variable derivation on classifier performance for options of all six tested steps. Furthermore, the findings indicate that the classification technique influences the impact of the variable derivation steps, and the bundling of steps has a significant higher performance impact if the variables are derived by using domain knowledge (compared to a non-knowledge application). Finally, the research turns out that an empirical examination of the domain knowledge impact is very complex due to a high level of interaction between the selected research parameters (variable derivation step, classification technique, and performance criteria).
- Full Text:
- Authors: Welcker, Laura Joana Maria
- Date: 2015
- Subjects: Data mining , Business -- Data processing , Database management
- Language: English
- Type: Thesis , Doctoral , DPhil
- Identifier: http://hdl.handle.net/10948/5009 , vital:20778
- Description: The technological progress in terms of increasing computational power and growing virtual space to collect data offers great potential for businesses to benefit from data mining applications. Data mining can create a competitive advantage for corporations by discovering business relevant information, such as patterns, relationships, and rules. The role of the human user within the data mining process is crucial, which is why the research area of domain knowledge becomes increasingly important. This thesis investigates the impact of domain knowledge-driven variable derivation on classifier performance for corporate data mining. Domain knowledge is defined as methodological, data and business know-how. The thesis investigates the topic from a new perspective by shifting the focus from a one-sided approach, namely a purely analytic or purely theoretical approach towards a target group-oriented (researcher and practitioner) approach which puts the methodological aspect by means of a scientific guideline in the centre of the research. In order to ensure feasibility and practical relevance of the guideline, it is adapted and applied to the requirements of a practical business case. Thus, the thesis examines the topic from both perspectives, a theoretical and practical perspective. Therewith, it overcomes the limitation of a one-sided approach which mostly lacks practical relevance or generalisability of the results. The primary objective of this thesis is to provide a scientific guideline which should enable both practitioners and researchers to move forward the domain knowledge-driven research for variable derivation on a corporate basis. In the theoretical part, a broad overview of the main aspects which are necessary to undertake the research are given, such as the concept of domain knowledge, the data mining task of classification, variable derivation as a subtask of data preparation, and evaluation techniques. This part of the thesis refers to the methodological aspect of domain knowledge. In the practical part, a research design is developed for testing six hypotheses related to domain knowledge-driven variable derivation. The major contribution of the empirical study is concerned with testing the impact of domain knowledge on a real business data set compared to the impact of a standard and randomly derived data set. The business application of the research is a binary classification problem in the domain of an insurance business, which deals with the prediction of damages in legal expenses insurances. Domain knowledge is expressed through deriving the corporate variables by means of the business and data-driven constructive induction strategy. Six variable derivation steps are investigated: normalisation, instance relation, discretisation, categorical encoding, ratio, and multivariate mathematical function. The impact of the domain knowledge is examined by pairwise (with and without derived variables) performance comparisons for five classification techniques (decision trees, naive Bayes, logistic regression, artificial neural networks, k-nearest neighbours). The impact is measured by two classifier performance criteria: sensitivity and area under the ROC-curve (AUC). The McNemar significance test is used to verify the results. Based on the results, two hypotheses are clearly verified and accepted, three hypotheses are partly verified, and one hypothesis had to be rejected on the basis of the case study results. The thesis reveals a significant positive impact of domain knowledge-driven variable derivation on classifier performance for options of all six tested steps. Furthermore, the findings indicate that the classification technique influences the impact of the variable derivation steps, and the bundling of steps has a significant higher performance impact if the variables are derived by using domain knowledge (compared to a non-knowledge application). Finally, the research turns out that an empirical examination of the domain knowledge impact is very complex due to a high level of interaction between the selected research parameters (variable derivation step, classification technique, and performance criteria).
- Full Text:
Log analysis aided by latent semantic mapping
- Authors: Buys, Stephanus
- Date: 2013 , 2013-04-14
- Subjects: Latent semantic indexing , Data mining , Computer networks -- Security measures , Computer hackers , Computer security
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4575 , http://hdl.handle.net/10962/d1002963 , Latent semantic indexing , Data mining , Computer networks -- Security measures , Computer hackers , Computer security
- Description: In an age of zero-day exploits and increased on-line attacks on computing infrastructure, operational security practitioners are becoming increasingly aware of the value of the information captured in log events. Analysis of these events is critical during incident response, forensic investigations related to network breaches, hacking attacks and data leaks. Such analysis has led to the discipline of Security Event Analysis, also known as Log Analysis. There are several challenges when dealing with events, foremost being the increased volumes at which events are often generated and stored. Furthermore, events are often captured as unstructured data, with very little consistency in the formats or contents of the events. In this environment, security analysts and implementers of Log Management (LM) or Security Information and Event Management (SIEM) systems face the daunting task of identifying, classifying and disambiguating massive volumes of events in order for security analysis and automation to proceed. Latent Semantic Mapping (LSM) is a proven paradigm shown to be an effective method of, among other things, enabling word clustering, document clustering, topic clustering and semantic inference. This research is an investigation into the practical application of LSM in the discipline of Security Event Analysis, showing the value of using LSM to assist practitioners in identifying types of events, classifying events as belonging to certain sources or technologies and disambiguating different events from each other. The culmination of this research presents adaptations to traditional natural language processing techniques that resulted in improved efficacy of LSM when dealing with Security Event Analysis. This research provides strong evidence supporting the wider adoption and use of LSM, as well as further investigation into Security Event Analysis assisted by LSM and other natural language or computer-learning processing techniques. , LaTeX with hyperref package , Adobe Acrobat 9.54 Paper Capture Plug-in
- Full Text:
- Authors: Buys, Stephanus
- Date: 2013 , 2013-04-14
- Subjects: Latent semantic indexing , Data mining , Computer networks -- Security measures , Computer hackers , Computer security
- Language: English
- Type: Thesis , Masters , MSc
- Identifier: vital:4575 , http://hdl.handle.net/10962/d1002963 , Latent semantic indexing , Data mining , Computer networks -- Security measures , Computer hackers , Computer security
- Description: In an age of zero-day exploits and increased on-line attacks on computing infrastructure, operational security practitioners are becoming increasingly aware of the value of the information captured in log events. Analysis of these events is critical during incident response, forensic investigations related to network breaches, hacking attacks and data leaks. Such analysis has led to the discipline of Security Event Analysis, also known as Log Analysis. There are several challenges when dealing with events, foremost being the increased volumes at which events are often generated and stored. Furthermore, events are often captured as unstructured data, with very little consistency in the formats or contents of the events. In this environment, security analysts and implementers of Log Management (LM) or Security Information and Event Management (SIEM) systems face the daunting task of identifying, classifying and disambiguating massive volumes of events in order for security analysis and automation to proceed. Latent Semantic Mapping (LSM) is a proven paradigm shown to be an effective method of, among other things, enabling word clustering, document clustering, topic clustering and semantic inference. This research is an investigation into the practical application of LSM in the discipline of Security Event Analysis, showing the value of using LSM to assist practitioners in identifying types of events, classifying events as belonging to certain sources or technologies and disambiguating different events from each other. The culmination of this research presents adaptations to traditional natural language processing techniques that resulted in improved efficacy of LSM when dealing with Security Event Analysis. This research provides strong evidence supporting the wider adoption and use of LSM, as well as further investigation into Security Event Analysis assisted by LSM and other natural language or computer-learning processing techniques. , LaTeX with hyperref package , Adobe Acrobat 9.54 Paper Capture Plug-in
- Full Text:
Establishing opportunities for using big data analysis at the Herald
- Authors: Joshua, Nadeem
- Date: 2018
- Subjects: Big data , Business intelligence -- Data processing , Data mining
- Language: English
- Type: Thesis , Masters , MBA
- Identifier: http://hdl.handle.net/10948/30529 , vital:30957
- Description: A few years ago, merely mentioning the term ‘big data’ within industry circles, would more than likely have received a quirky and confused look; however, the term big data has gained huge popularity in recent years among IT professionals and academics. The big data phenomenon has exploded in popularity worldwide, and continues to grow exponentially with each passing day. It has been good news for many industries, as industries are going ablaze with the huge volume, variety and velocity of data. As technology advances it is lifting and removing so many boundaries, and answering questions that are not currently being asked. Therefore, it is that big data is taking the world by storm, and it is safe to say that big data has gone mainstream with countless benefits being developed within industries. The opportunity for employing big data strategies are many, according to McKinsey and Company, and the growth in big data will spark a new wave of ‘innovation, competition and productivity’ within businesses (McKinsey & Company, 2011). Taking advantage of these opportunities will be challenging for companies, creating the need for new skills, tools and ways of thinking. Implementing big data would help in creating new innovative business models, as executives are challenged to make their organisations resilient and agile in today’s challenging business environment. This research paper aimed to unpack the understanding of big data, the challenges, and the value to an organisation and provide a guideline or framework to implement a big data strategy. Furthermore, this research examines the opportunities and the potential value that organisations would obtain from implementing big data, as well as the challenges that could hinder implementation. Due to the rapid growth and size of data, decision-makers need to be able to gain valuable insights from such varied and rapidly changing data that will help organisations make far better, intelligent and data-driven decisions which may help in improving operations.
- Full Text:
- Authors: Joshua, Nadeem
- Date: 2018
- Subjects: Big data , Business intelligence -- Data processing , Data mining
- Language: English
- Type: Thesis , Masters , MBA
- Identifier: http://hdl.handle.net/10948/30529 , vital:30957
- Description: A few years ago, merely mentioning the term ‘big data’ within industry circles, would more than likely have received a quirky and confused look; however, the term big data has gained huge popularity in recent years among IT professionals and academics. The big data phenomenon has exploded in popularity worldwide, and continues to grow exponentially with each passing day. It has been good news for many industries, as industries are going ablaze with the huge volume, variety and velocity of data. As technology advances it is lifting and removing so many boundaries, and answering questions that are not currently being asked. Therefore, it is that big data is taking the world by storm, and it is safe to say that big data has gone mainstream with countless benefits being developed within industries. The opportunity for employing big data strategies are many, according to McKinsey and Company, and the growth in big data will spark a new wave of ‘innovation, competition and productivity’ within businesses (McKinsey & Company, 2011). Taking advantage of these opportunities will be challenging for companies, creating the need for new skills, tools and ways of thinking. Implementing big data would help in creating new innovative business models, as executives are challenged to make their organisations resilient and agile in today’s challenging business environment. This research paper aimed to unpack the understanding of big data, the challenges, and the value to an organisation and provide a guideline or framework to implement a big data strategy. Furthermore, this research examines the opportunities and the potential value that organisations would obtain from implementing big data, as well as the challenges that could hinder implementation. Due to the rapid growth and size of data, decision-makers need to be able to gain valuable insights from such varied and rapidly changing data that will help organisations make far better, intelligent and data-driven decisions which may help in improving operations.
- Full Text:
- «
- ‹
- 1
- ›
- »