Skip to main content

AI-WEKA: extending WEKA with deep learning text understanding


Coronavirus information for applicants and offer holders

We hope that by the time you’re ready to start your studies with us the situation with COVID-19 will have eased. However, please be aware, we will continue to review our courses and other elements of the student experience in response to COVID-19 and we may need to adapt our provision to ensure students remain safe. For the most up-to-date information on COVID-19, regularly visit our website, which we will continue to update as the situation changes

Key facts

Type of research degree
Application deadline
Sunday 3 July 2022
Project start date
Saturday 1 October 2022
Country eligibility
International (open to all nationalities, including the UK)
Professor Eric Atwell
School of Computing
Research groups/institutes
Artificial Intelligence
<h2 class="heading hide-accessible">Summary</h2>

WEKA is an open-source Java toolkit for data mining and text analytics used in teaching and research in universities (eg Leeds University) and for practical applications in industry (Hall et al. 2009) It enables non-programmers to run experiments in machine learning, data mining and text analytics, as researchers can apply a wide range of data pre-processing filters, machine learning algorithms, and evaluation metrics via a simple graphical user interface without having to write and run code in Python, R or other programming language. Your project will contribute to WEKA by adding to the existing code-base, adding new functionality. WEKA currently accepts data input in ARFF, CSV and some other formats, and has basic filters to convert text strings to number vectors for data-mining. One topic for a project is to extend the current String-to-Word-Vector filter in WEKA. Currently the filter has basic tokenisation by spaces; you will integrate more sophisticated tokenization by morphological analysis or WordPiece (Wu et al 2016) (Kudo and Richardson 2018); you will also extend the tokenisation to deal with text in languages other than English, for example Arabic, by integrating appropriate morphological analysis in tokenisation. Currently word-sequences are converted to one-hot vectors, and you will extend this to convert to other vector representations such as word embeddings (Mikolov et al 2013a,b,c), (Pennington et al 2014). You can use word embedding vector representations of words from Deep Learning pre-trained models, to allow WEKA users to encode their text data as Word embedding vectors. WEKA currently offers a wide range of traditional classifiers, but is limited in neural network and deep learning classifiers to choose from. Another extension for your project is to add deep learning classifiers such as BERT (Devlin et al 2019) to the range of classifiers available in WEKA. You will investigate state-of-the-art deep learning models used in current text analytics research, and add one or more of these to the WEKA toolkit. To demonstrate and evaluate your WEKA extensions, you will apply them in several case-study text analytics evaluation tasks, eg (Devlin et al 2019), (Roberts and Egg 2018), (Pickard 2020), Deployment will be an important part of your project: once you have implemented and tested your solution, you will also need to formally Contribute your software (and documentation) to the WEKA community, to enable others to benefit. If you succeed, your project deliverables could be re-used by AI researchers and students world-wide.

<h2 class="heading hide-accessible">Full description</h2>

<p>WEKA:&nbsp;&nbsp;<a href="">;</a></p> <p>REFERENCES:</p> <p>Devlin, J et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proc NAACL&rsquo;2019.</p> <p>Hall, M., et al. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations 11(1), pp.10-18.</p> <p>Kudo, T. and Richardson, J., 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.</p> <p>Mikolov,T et al. 2013. Efficient Estimation of Word Representations in Vector Space. Proc ICLR&rsquo;2013.</p> <p>Mikolov, T et al. 2013. Distributed Representations of Words and Phrases and their Compositionality. Proc NIPS&rsquo;2013.</p> <p>Mikolov, T et al. 2013. Linguistic Regularities in Continuous Space Word Representations. Proc NAACL-HLT&rsquo;2013.</p> <p>Pennington, J et al. 2014. GloVe: Global Vectors for Word Representation. Proc EMNLP&rsquo;2014 pp1532&ndash;1543</p> <p>Pickard, T., 2020. Comparing word2vec and GloVe for Automatic Measurement of MWE Compositionality. In Proc Multiword Expressions and Electronic Lexicons. pp. 95-100.</p> <p>Roberts W and Egg M. 2018. A Large Automatically-Acquired All-Words List of Multiword Expressions Scored for Compositionality. Proc LREC&rsquo;2018</p> <p>Wu, Y et al. 2016. Google&#39;s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.</p> <p>&nbsp;</p> <p>&nbsp;</p> <p>&nbsp;</p> <p>&nbsp;</p>

<h2 class="heading">How to apply</h2>

<p>Formal applications for research degree study should be made online through the&nbsp;<a href="">University&#39;s website</a>. Please state clearly in the research information section&nbsp;that the research degree you wish to be considered for is <em><strong>AI-WEKA: extending WEKA with deep learning text understanding</strong></em> as well as&nbsp;<a href="">Professor Eric Atwell</a> as your proposed supervisor.</p> <p>If English is not your first language, you must provide evidence that you meet the University&#39;s minimum English language requirements (below).</p> <p>&nbsp;</p>

<h2 class="heading heading--sm">Entry requirements</h2>

Applicants to research degree programmes should normally have at least a first class or an upper second class British Bachelors Honours degree (or equivalent) in an appropriate discipline. The criteria for entry for some research degrees may be higher, for example, several faculties, also require a Masters degree. Applicants are advised to check with the relevant School prior to making an application. Applicants who are uncertain about the requirements for a particular research degree are advised to contact the School or Graduate School prior to making an application.

<h2 class="heading heading--sm">English language requirements</h2>

The minimum English language entry requirement for research postgraduate research study is an IELTS of 6.5 overall with at least 6.5 in writing and at 6.0 in reading, listening and speaking) or equivalent. The test must be dated within two years of the start date of the course in order to be valid. Some schools and faculties have a higher requirement.

<h2 class="heading">Funding on offer</h2>

<p><strong>Self-Funded or externally sponsored students are welcome to apply.</strong></p> <p><strong>UK</strong>&nbsp;&ndash;&nbsp;The&nbsp;<a href="">Leeds Doctoral Scholarships</a>, <a href="">School of Computing Scholarship&nbsp;</a>, <a href="">Akroyd &amp; Brown</a>, <a href="">Frank Parkinson</a> and <a href="">Boothman, Reynolds &amp; Smithells</a> Scholarships are available to UK applicants. &nbsp;<a href="">Alumni Bursary</a> is available to graduates of the University of Leeds.&nbsp;</p> <p><strong>Non-UK</strong>&nbsp;&ndash; The&nbsp;<a href="">School of Computing Scholarship&nbsp;</a>&nbsp;is available to support the additional academic fees of Non-UK applicants. The&nbsp;<a href="">China Scholarship Council - University of Leeds Scholarship</a>&nbsp;is available to nationals of China. The&nbsp;<a href="">Leeds Marshall Scholarship</a>&nbsp;is available to support US citizens. <a href="">Alumni Bursary</a> is available to graduates of the University of Leeds.</p>

<h2 class="heading">Contact details</h2>

<p>For further information regarding your application, please contact Doctoral College Admissions<br /> e:&nbsp;<a href=""></a>, t: +44 (0)113 343 5057.</p> <p>For further information regarding the project, please contact Professor Eric Atwell<br /> e:&nbsp;<a href=""></a></p>

<h3 class="heading heading--sm">Linked research areas</h3>