Skip to main content

AI-MWE: Multi-Word Expressions, word-embedding models, and ontology terms


Coronavirus information for applicants and offer holders

We hope that by the time you’re ready to start your studies with us the situation with COVID-19 will have eased. However, please be aware, we will continue to review our courses and other elements of the student experience in response to COVID-19 and we may need to adapt our provision to ensure students remain safe. For the most up-to-date information on COVID-19, regularly visit our website, which we will continue to update as the situation changes

Key facts

Type of research degree
Application deadline
Sunday 3 July 2022
Project start date
Saturday 1 October 2022
Country eligibility
International (open to all nationalities, including the UK)
Professor Eric Atwell
School of Computing
Research groups/institutes
Artificial Intelligence
<h2 class="heading hide-accessible">Summary</h2>

This project will investigate the interaction between Multi-Word Expressions or phrases, word-embedding models, and ontology terms. Multi-Word Expressions have meaning not predictable from the constituent words. For example, a red herring is a clue which is misleading or distracting, and this is not predictable from the meanings of red and herring. A word embedding is a vector of numbers representing the meaning of the word, calculated from examples of uses of the word in a large corpus. An ontology term is a word or phrase labelling a concept or category in an ontology, a graph of concepts and categories in a specialised subject area or domain.<br /> <br /> Word embedding models build compact (embedded) vector representations of the meaning of each word-type in a corpus, which capture the set of distributional contexts that the word-tokens occupy in a corpus. This assumes that the corpus text is tokenised into word-tokens, separated by spaces. This assumption works in general for English text; but does not handle Multi-Word Expressions correctly. <br /> To extract a dictionary of Multi-Word Expressions from a corpus, a standard approach, eg (Alghamdi and Atwell 2019) is to first extract a set of candidate multi-word patterns which meet some statistical collocational criteria (words frequently found together); and then manually check whether these candidates meet some semantic criteria (eg phrase meaning differs from word meanings). <br /> <br /> In this project, you will investigate how to apply word embedding models as semantic criteria to identify multi-word units with meaning not directly predicted from constituent word meanings. For example, tokenize red herring as one word, and generate a word embedding model of contexts of red herring; then build another word embedding model with independent word embedding vectors for red and herring. If the vector for red herring is significantly different from the vectors for red and herring, this indicates a Multi-Word Expression, eg see (Pickard 2020),<br /> Then, if word embedding models can help to identify Multi Word Expressions in a corpus, this knowledge can in turn be used to tokenise the corpus better; for example, red herring will be tokenised as one unit not two. This should generate better word-embedding models, where context-vectors take intro account Multi-Word Expressions. For example, the word clue includes red herring in its context, rather than red and herring. <br /> <br /> This should lead to better tokenization and vector-models for both words and Multi-Word Expressions<br />

<h2 class="heading hide-accessible">Full description</h2>

<p>Multi-Word Expressions have practical applications. Advanced language learners of English, Arabic, or other languages need to extend their vocabulary beyond words to learn meanings and uses of these multi-word expressions. For language of a specialised domain, Multi-Word Expressions are often used as terms to label concepts in an ontology. For example, an ontology of crime fiction may include red herring&nbsp;as the label for a concept related to crime detection. Corpus linguists have studied and used corpus-based distributional lexical semantics since the 1950s in language teaching and applied linguistics; but the past decade has seen the takeover of distributional lexical semantics as the basis of deep learning word embedding modelling, which has dominated natural language processing research, and &nbsp;... has come into the field like a schoolyard bully, forcing everything that&rsquo;s not computational into submission, collusion or the margins&nbsp;(Adam Kilgarriff, SketchEngine founder).</p> <p>&nbsp;</p> <p>REFERENCES</p> <p>Alghamdi A, Atwell E. 2019. Constructing a corpus-informed list of Arabic formulaic sequences (ArFSs) for language pedagogy and technology. International Journal of Corpus Linguistics. 24(2), pp. 202-228.</p> <p>Alrehaili S, Atwell E. 2017. Extraction of Multi-Word Terms and Complex Terms from the Classical Arabic Text of the Quran. International Journal on Islamic Applications in Computer Science And Technology. 5(3), pp. 15-27</p> <p>Atwell E, Andy R. 2007. CHEAT: combinatory hybrid elementary analysis of text. Proc Corpus Linguistics 2007</p> <p>Baroni, M et al. 2014. Don&rsquo;t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proc ACL&rsquo;2014 pp238&ndash;247</p> <p>Devlin, J et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proc NAACL&rsquo;2019.</p> <p>Kilgarriff, A. et al. 2014. The Sketch Engine: ten years on. Lexicography journal 1:7-36</p> <p>Kudo, T. and Richardson, J., 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.</p> <p>Mikolov,T et al. 2013. Efficient Estimation of Word Representations in Vector Space. Proc ICLR&rsquo;2013.</p> <p>Mikolov, T et al. 2013. Distributed Representations of Words and Phrases and their Compositionality. Proc NIPS&rsquo;2013.</p> <p>Mikolov, T et al. 2013. Linguistic Regularities in Continuous Space Word Representations. Proc NAACL-HLT&rsquo;2013.</p> <p>Pennington, J et al. 2014. GloVe: Global Vectors for Word Representation. Proc EMNLP&rsquo;2014 pp1532&ndash;1543</p> <p>Pickard, T., 2020. Comparing word2vec and GloVe for Automatic Measurement of MWE Compositionality. In Proc Multiword Expressions and Electronic Lexicons. pp. 95-100.</p> <p>Roberts W and Egg M. 2018. A Large Automatically-Acquired All-Words List of Multiword Expressions Scored for Compositionality. Proc LREC&rsquo;2018</p> <p>SketchEngine. 2017. Embedding Viewer.</p> <p>Wu, Y et al. 2016. Google&#39;s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144</p>

<h2 class="heading">How to apply</h2>

<p>Formal applications for research degree study should be made online through the&nbsp;<a href="">University&#39;s website</a>. Please state clearly in the research information section&nbsp;that the research degree you wish to be considered for is <strong>:</strong>&nbsp;<em><strong>AI-MWE: Multi-Word Expressions, word-embedding models, and ontology terms</strong></em> as well as <a href="">Professor Eric Atwell</a> as your proposed supervisor.</p> <p>If English is not your first language, you must provide evidence that you meet the University&#39;s minimum English language requirements (below).</p> <p>&nbsp;</p>

<h2 class="heading heading--sm">Entry requirements</h2>

Applicants to research degree programmes should normally have at least a first class or an upper second class British Bachelors Honours degree (or equivalent) in an appropriate discipline. The criteria for entry for some research degrees may be higher, for example, several faculties, also require a Masters degree. Applicants are advised to check with the relevant School prior to making an application. Applicants who are uncertain about the requirements for a particular research degree are advised to contact the School or Graduate School prior to making an application.

<h2 class="heading heading--sm">English language requirements</h2>

The minimum English language entry requirement for research postgraduate research study is an IELTS of 6.5 overall with at least 6.5 in writing and at 6.0 in reading, listening and speaking) or equivalent. The test must be dated within two years of the start date of the course in order to be valid. Some schools and faculties have a higher requirement.

<h2 class="heading">Funding on offer</h2>

<p><strong>Self-Funded or externally sponsored students are welcome to apply.</strong></p> <p><strong>UK</strong>&nbsp;&ndash;&nbsp;The&nbsp;<a href="">Leeds Doctoral Scholarships</a>, <a href="">School of Computing Scholarship&nbsp;</a>, <a href="">Akroyd &amp; Brown</a>, <a href="">Frank Parkinson</a> and <a href="">Boothman, Reynolds &amp; Smithells</a> Scholarships are available to UK applicants. &nbsp;<a href="">Alumni Bursary</a> is available to graduates of the University of Leeds.&nbsp;</p> <p><strong>Non-UK</strong>&nbsp;&ndash; The&nbsp;<a href="">School of Computing Scholarship&nbsp;</a>&nbsp;is available to support the additional academic fees of Non-UK applicants. The&nbsp;<a href="">China Scholarship Council - University of Leeds Scholarship</a>&nbsp;is available to nationals of China. The&nbsp;<a href="">Leeds Marshall Scholarship</a>&nbsp;is available to support US citizens. <a href="">Alumni Bursary</a> is available to graduates of the University of Leeds.</p>

<h2 class="heading">Contact details</h2>

<p>For further information regarding your application, please contact Doctoral College Admissions<br /> e:&nbsp;<a href=""></a>, t: +44 (0)113 343 5057.</p> <p>For further information regarding the project, please contact Professor Eric Atwell<br /> e:&nbsp;<a href=""></a></p>

<h3 class="heading heading--sm">Linked research areas</h3>