Key facts
- Type of research degree
- PhD
- Application deadline
- Sunday 3 July 2022
- Project start date
- Saturday 1 October 2022
- Country eligibility
- International (open to all nationalities, including the UK)
- Funding
- Non-funded
- Supervisors
- Professor Eric Atwell
- Schools
- School of Computing
- Research groups/institutes
- Artificial Intelligence
This project will investigate the interaction between Multi-Word Expressions or phrases, word-embedding models, and ontology terms. Multi-Word Expressions have meaning not predictable from the constituent words. For example, a red herring is a clue which is misleading or distracting, and this is not predictable from the meanings of red and herring. A word embedding is a vector of numbers representing the meaning of the word, calculated from examples of uses of the word in a large corpus. An ontology term is a word or phrase labelling a concept or category in an ontology, a graph of concepts and categories in a specialised subject area or domain.<br /> <br /> Word embedding models build compact (embedded) vector representations of the meaning of each word-type in a corpus, which capture the set of distributional contexts that the word-tokens occupy in a corpus. This assumes that the corpus text is tokenised into word-tokens, separated by spaces. This assumption works in general for English text; but does not handle Multi-Word Expressions correctly. <br /> To extract a dictionary of Multi-Word Expressions from a corpus, a standard approach, eg (Alghamdi and Atwell 2019) is to first extract a set of candidate multi-word patterns which meet some statistical collocational criteria (words frequently found together); and then manually check whether these candidates meet some semantic criteria (eg phrase meaning differs from word meanings). <br /> <br /> In this project, you will investigate how to apply word embedding models as semantic criteria to identify multi-word units with meaning not directly predicted from constituent word meanings. For example, tokenize red herring as one word, and generate a word embedding model of contexts of red herring; then build another word embedding model with independent word embedding vectors for red and herring. If the vector for red herring is significantly different from the vectors for red and herring, this indicates a Multi-Word Expression, eg see (Pickard 2020),<br /> Then, if word embedding models can help to identify Multi Word Expressions in a corpus, this knowledge can in turn be used to tokenise the corpus better; for example, red herring will be tokenised as one unit not two. This should generate better word-embedding models, where context-vectors take intro account Multi-Word Expressions. For example, the word clue includes red herring in its context, rather than red and herring. <br /> <br /> This should lead to better tokenization and vector-models for both words and Multi-Word Expressions<br />
<p>Multi-Word Expressions have practical applications. Advanced language learners of English, Arabic, or other languages need to extend their vocabulary beyond words to learn meanings and uses of these multi-word expressions. For language of a specialised domain, Multi-Word Expressions are often used as terms to label concepts in an ontology. For example, an ontology of crime fiction may include red herring as the label for a concept related to crime detection. Corpus linguists have studied and used corpus-based distributional lexical semantics since the 1950s in language teaching and applied linguistics; but the past decade has seen the takeover of distributional lexical semantics as the basis of deep learning word embedding modelling, which has dominated natural language processing research, and ... has come into the field like a schoolyard bully, forcing everything that’s not computational into submission, collusion or the margins (Adam Kilgarriff, SketchEngine founder).</p> <p> </p> <p>REFERENCES</p> <p>Alghamdi A, Atwell E. 2019. Constructing a corpus-informed list of Arabic formulaic sequences (ArFSs) for language pedagogy and technology. International Journal of Corpus Linguistics. 24(2), pp. 202-228.</p> <p>Alrehaili S, Atwell E. 2017. Extraction of Multi-Word Terms and Complex Terms from the Classical Arabic Text of the Quran. International Journal on Islamic Applications in Computer Science And Technology. 5(3), pp. 15-27</p> <p>Atwell E, Andy R. 2007. CHEAT: combinatory hybrid elementary analysis of text. Proc Corpus Linguistics 2007</p> <p>Baroni, M et al. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proc ACL’2014 pp238–247</p> <p>Devlin, J et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proc NAACL’2019.</p> <p>Kilgarriff, A. et al. 2014. The Sketch Engine: ten years on. Lexicography journal 1:7-36</p> <p>Kudo, T. and Richardson, J., 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.</p> <p>Mikolov,T et al. 2013. Efficient Estimation of Word Representations in Vector Space. Proc ICLR’2013.</p> <p>Mikolov, T et al. 2013. Distributed Representations of Words and Phrases and their Compositionality. Proc NIPS’2013.</p> <p>Mikolov, T et al. 2013. Linguistic Regularities in Continuous Space Word Representations. Proc NAACL-HLT’2013.</p> <p>Pennington, J et al. 2014. GloVe: Global Vectors for Word Representation. Proc EMNLP’2014 pp1532–1543</p> <p>Pickard, T., 2020. Comparing word2vec and GloVe for Automatic Measurement of MWE Compositionality. In Proc Multiword Expressions and Electronic Lexicons. pp. 95-100.</p> <p>Roberts W and Egg M. 2018. A Large Automatically-Acquired All-Words List of Multiword Expressions Scored for Compositionality. Proc LREC’2018</p> <p>SketchEngine. 2017. Embedding Viewer.</p> <p>Wu, Y et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144</p>
<p>Formal applications for research degree study should be made online through the <a href="https://www.leeds.ac.uk/research-applying/doc/applying-research-degrees">University's website</a>. Please state clearly in the research information section that the research degree you wish to be considered for is <strong>:</strong> <em><strong>AI-MWE: Multi-Word Expressions, word-embedding models, and ontology terms</strong></em> as well as <a href="https://eps.leeds.ac.uk/computing/staff/33/professor-eric-atwell">Professor Eric Atwell</a> as your proposed supervisor.</p> <p>If English is not your first language, you must provide evidence that you meet the University's minimum English language requirements (below).</p> <p> </p>
Applicants to research degree programmes should normally have at least a first class or an upper second class British Bachelors Honours degree (or equivalent) in an appropriate discipline. The criteria for entry for some research degrees may be higher, for example, several faculties, also require a Masters degree. Applicants are advised to check with the relevant School prior to making an application. Applicants who are uncertain about the requirements for a particular research degree are advised to contact the School or Graduate School prior to making an application.
The minimum English language entry requirement for research postgraduate research study is an IELTS of 6.5 overall with at least 6.5 in writing and at 6.0 in reading, listening and speaking) or equivalent. The test must be dated within two years of the start date of the course in order to be valid. Some schools and faculties have a higher requirement.
<p><strong>Self-Funded or externally sponsored students are welcome to apply.</strong></p> <p><strong>UK</strong> – The <a href="https://phd.leeds.ac.uk/funding/209-leeds-doctoral-scholarships-2022">Leeds Doctoral Scholarships</a>, <a href="https://phd.leeds.ac.uk/funding/53-school-of-computing-scholarship">School of Computing Scholarship </a>, <a href="https://phd.leeds.ac.uk/funding/198-akroyd-and-brown-scholarship-2022">Akroyd & Brown</a>, <a href="https://phd.leeds.ac.uk/funding/199-frank-parkinson-scholarship-2022">Frank Parkinson</a> and <a href="https://phd.leeds.ac.uk/funding/204-boothman-reynolds-and-smithells-scholarship-2022">Boothman, Reynolds & Smithells</a> Scholarships are available to UK applicants. <a href="https://phd.leeds.ac.uk/funding/60-alumni-bursary">Alumni Bursary</a> is available to graduates of the University of Leeds. </p> <p><strong>Non-UK</strong> – The <a href="https://phd.leeds.ac.uk/funding/53-school-of-computing-scholarship">School of Computing Scholarship </a> is available to support the additional academic fees of Non-UK applicants. The <a href="https://phd.leeds.ac.uk/funding/48-china-scholarship-council-university-of-leeds-scholarships-2021">China Scholarship Council - University of Leeds Scholarship</a> is available to nationals of China. The <a href="https://phd.leeds.ac.uk/funding/73-leeds-marshall-scholarship">Leeds Marshall Scholarship</a> is available to support US citizens. <a href="https://phd.leeds.ac.uk/funding/60-alumni-bursary">Alumni Bursary</a> is available to graduates of the University of Leeds.</p>
<p>For further information regarding your application, please contact Doctoral College Admissions<br /> e: <a href="mailto:phd@engineering.leeds.ac.uk">phd@engineering.leeds.ac.uk</a>, t: +44 (0)113 343 5057.</p> <p>For further information regarding the project, please contact Professor Eric Atwell<br /> e: <a href="mailto:E.S.Atwell@leeds.ac.uk">E.S.Atwell@leeds.ac.uk</a></p>
<h3 class="heading heading--sm">Linked research areas</h3>