Augmenting encoder-decoder networks for first-order logic formula parsing using attention pointer mechanisms
- Authors: Tissink, Kade
- Date: 2024-04
- Subjects: Translators (Computer programs) , Computational linguistics , Computer science
- Language: English
- Type: Master's theses , text
- Identifier: http://hdl.handle.net/10948/64390 , vital:73692
- Description: Semantic parsing is the task of extracting a structured machine-interpretable representation from natural language utterance. This representation can be used for various applications such as question answering, information extraction, and dialogue systems. However, semantic parsing is a challenging problem that requires dealing with the ambiguity, variability, and complexity of natural language. This dissertation investigates neural parsing of natural language (NL) sentences to first-order logic (FOL) formulas. FOL is a widely used formal language for expressing logical statements and reasoning. FOL formulas can capture the meaning and structure of natural language sentences in a precise and unambiguous way. The problem is initially approached as a sequence-to-sequence mapping task using both LSTM-based and transformer encoder-decoder architectures for character-, subword-, and wordlevel text tokenisation. These models are trained on NL-FOL datasets using supervised learning and evaluated on various metrics such as exact match accuracy, syntactic validity, formula structure accuracy, and predicate/constant similarity. A novel augmented model is then introduced that decomposes the task of neural FOL parsing into four inter-dependent subtasks: template decoding, predicate and constant recognition, predicate set pointing, and object set pointing. The components for the four subtasks are jointly trained using multi-task learning and evaluated using the same metrics from the sequence-tosequence models. The results indicate improved performance over the sequence-to-sequence models and the modular design allows for more interpretability and flexibility. Additionally, to compensate for the scarcity of open-source, labelled NL-FOL datasets, a new benchmark is constructed from publicly accessible data. The data consists of NL sentences paired with corresponding FOL formulas in a standardised notation. The data is split into training, validation, and test sets. The main contributions of this dissertation are: an in-depth literature review covering decades of research presented with a consistent notation, the formation of a complex NL-FOL benchmark that includes algorithmically generated and human-annotated FOL formulas, proposal of a novel transformer encoder-decoder architecture that is shown to successfully train at significant depths, evaluation of twenty sequence-to-sequence models on the task of neural FOL parsing for different text representations and encoder-decoder architectures, the proposal of a novel augmented FOL parsing architecture, and an in-depth analysis of the strengths and weaknesses of these models. , Thesis (MSc) -- Faculty of Science, School of Computer Science, Mathematics, Physics and Statistics , 2024
- Full Text:
- Date Issued: 2024-04
- Authors: Tissink, Kade
- Date: 2024-04
- Subjects: Translators (Computer programs) , Computational linguistics , Computer science
- Language: English
- Type: Master's theses , text
- Identifier: http://hdl.handle.net/10948/64390 , vital:73692
- Description: Semantic parsing is the task of extracting a structured machine-interpretable representation from natural language utterance. This representation can be used for various applications such as question answering, information extraction, and dialogue systems. However, semantic parsing is a challenging problem that requires dealing with the ambiguity, variability, and complexity of natural language. This dissertation investigates neural parsing of natural language (NL) sentences to first-order logic (FOL) formulas. FOL is a widely used formal language for expressing logical statements and reasoning. FOL formulas can capture the meaning and structure of natural language sentences in a precise and unambiguous way. The problem is initially approached as a sequence-to-sequence mapping task using both LSTM-based and transformer encoder-decoder architectures for character-, subword-, and wordlevel text tokenisation. These models are trained on NL-FOL datasets using supervised learning and evaluated on various metrics such as exact match accuracy, syntactic validity, formula structure accuracy, and predicate/constant similarity. A novel augmented model is then introduced that decomposes the task of neural FOL parsing into four inter-dependent subtasks: template decoding, predicate and constant recognition, predicate set pointing, and object set pointing. The components for the four subtasks are jointly trained using multi-task learning and evaluated using the same metrics from the sequence-tosequence models. The results indicate improved performance over the sequence-to-sequence models and the modular design allows for more interpretability and flexibility. Additionally, to compensate for the scarcity of open-source, labelled NL-FOL datasets, a new benchmark is constructed from publicly accessible data. The data consists of NL sentences paired with corresponding FOL formulas in a standardised notation. The data is split into training, validation, and test sets. The main contributions of this dissertation are: an in-depth literature review covering decades of research presented with a consistent notation, the formation of a complex NL-FOL benchmark that includes algorithmically generated and human-annotated FOL formulas, proposal of a novel transformer encoder-decoder architecture that is shown to successfully train at significant depths, evaluation of twenty sequence-to-sequence models on the task of neural FOL parsing for different text representations and encoder-decoder architectures, the proposal of a novel augmented FOL parsing architecture, and an in-depth analysis of the strengths and weaknesses of these models. , Thesis (MSc) -- Faculty of Science, School of Computer Science, Mathematics, Physics and Statistics , 2024
- Full Text:
- Date Issued: 2024-04
A corpus-based investigation of Xhosa English in the classroom setting
- Authors: Platt, Candice Lee
- Date: 2004 , 2013-06-03
- Subjects: English language -- Study and teaching (Foreign speakers) -- South Africa , Computational linguistics , Black English -- South Africa , Black people -- South Africa -- Eastern Cape -- Education
- Language: English
- Type: Thesis , Masters , MA
- Identifier: vital:2379 , http://hdl.handle.net/10962/d1007613 , English language -- Study and teaching (Foreign speakers) -- South Africa , Computational linguistics , Black English -- South Africa , Black people -- South Africa -- Eastern Cape -- Education
- Description: This study is an investigation of Xhosa English as used by teachers in the Grahamstown area of the Eastern Cape. The aims of the study were firstly, to compile a 20 000 word mini-corpus of the spoken English of Xhosa mother-tongue teachers in Grahamstown, and to use this data to describe the characteristics of Xhosa English used in the classroom context; and secondly, to assess the usefulness of a corpus-based approach to a study of this nature. The English of five Xhosa mother-tongue teachers was investigated. These teachers were recorded while teaching in English and the data was then transcribed for analysis. The data was analysed using Wordsmith Tools to investigate patterns in the teachers' language. Grammatical, lexical and discourse patterns were explored based on the findings of other researchers' investigations of Black South African English and Xhosa English. In general, many of the patterns reported in the literature were found in the data, but to a lesser extent than reported in literature which gave quantitative information. Some features not described elsewhere were also found. The corpus-based approach was found to be useful within the limits of pattern-matching. , KMBT_363 , Adobe Acrobat 9.54 Paper Capture Plug-in
- Full Text:
- Date Issued: 2004
- Authors: Platt, Candice Lee
- Date: 2004 , 2013-06-03
- Subjects: English language -- Study and teaching (Foreign speakers) -- South Africa , Computational linguistics , Black English -- South Africa , Black people -- South Africa -- Eastern Cape -- Education
- Language: English
- Type: Thesis , Masters , MA
- Identifier: vital:2379 , http://hdl.handle.net/10962/d1007613 , English language -- Study and teaching (Foreign speakers) -- South Africa , Computational linguistics , Black English -- South Africa , Black people -- South Africa -- Eastern Cape -- Education
- Description: This study is an investigation of Xhosa English as used by teachers in the Grahamstown area of the Eastern Cape. The aims of the study were firstly, to compile a 20 000 word mini-corpus of the spoken English of Xhosa mother-tongue teachers in Grahamstown, and to use this data to describe the characteristics of Xhosa English used in the classroom context; and secondly, to assess the usefulness of a corpus-based approach to a study of this nature. The English of five Xhosa mother-tongue teachers was investigated. These teachers were recorded while teaching in English and the data was then transcribed for analysis. The data was analysed using Wordsmith Tools to investigate patterns in the teachers' language. Grammatical, lexical and discourse patterns were explored based on the findings of other researchers' investigations of Black South African English and Xhosa English. In general, many of the patterns reported in the literature were found in the data, but to a lesser extent than reported in literature which gave quantitative information. Some features not described elsewhere were also found. The corpus-based approach was found to be useful within the limits of pattern-matching. , KMBT_363 , Adobe Acrobat 9.54 Paper Capture Plug-in
- Full Text:
- Date Issued: 2004
- «
- ‹
- 1
- ›
- »