- Title
- Prediction of protein secondary structure using binary classificationtrees, naive Bayes classifiers and the Logistic Regression Classifier
- Creator
- Eldud Omer, Ahmed Abdelkarim
- Subject
- Bayesian statistical decision theory
- Subject
- Logistic regression analysis
- Subject
- Biostatistics
- Subject
- Proteins -- Structure
- Date Issued
- 2016
- Date
- 2016
- Type
- Thesis
- Type
- Masters
- Type
- MSc
- Identifier
- vital:5581
- Identifier
- http://hdl.handle.net/10962/d1019985
- Description
- The secondary structure of proteins is predicted using various binary classifiers. The data are adopted from the RS126 database. The original data consists of protein primary and secondary structure sequences. The original data is encoded using alphabetic letters. These data are encoded into unary vectors comprising ones and zeros only. Different binary classifiers, namely the naive Bayes, logistic regression and classification trees using hold-out and 5-fold cross validation are trained using the encoded data. For each of the classifiers three classification tasks are considered, namely helix against not helix (H/∼H), sheet against not sheet (S/∼S) and coil against not coil (C/∼C). The performance of these binary classifiers are compared using the overall accuracy in predicting the protein secondary structure for various window sizes. Our result indicate that hold-out cross validation achieved higher accuracy than 5-fold cross validation. The Naive Bayes classifier, using 5-fold cross validation achieved, the lowest accuracy for predicting helix against not helix. The classification tree classifiers, using 5-fold cross validation, achieved the lowest accuracies for both coil against not coil and sheet against not sheet classifications. The logistic regression classier accuracy is dependent on the window size; there is a positive relationship between the accuracy and window size. The logistic regression classier approach achieved the highest accuracy when compared to the classification tree and Naive Bayes classifiers for each classification task; predicting helix against not helix with accuracy 77.74 percent, for sheet against not sheet with accuracy 81.22 percent and for coil against not coil with accuracy 73.39 percent. It is noted that it is easier to compare classifiers if the classification process could be completely facilitated in R. Alternatively, it would be easier to assess these logistic regression classifiers if SPSS had a function to determine the accuracy of the logistic regression classifier.
- Format
- 119 p.
- Format
- Publisher
- Rhodes University
- Publisher
- Faculty of Science, Statistics
- Language
- English
- Rights
- Eldud Omer, Ahmed Abdelkarim
- Hits: 3397
- Visitors: 3522
- Downloads: 229
Thumbnail | File | Description | Size | Format | |||
---|---|---|---|---|---|---|---|
View Details Download | SOURCEPDF | 1 MB | Adobe Acrobat PDF | View Details Download |