Skip to main content

AI-Dialect: Social media corpus resources for minority language varieties and dialects

PGR-P-1039

Coronavirus information for applicants and offer holders

We hope that by the time you’re ready to start your studies with us the situation with COVID-19 will have eased. However, please be aware, we will continue to review our courses and other elements of the student experience in response to COVID-19 and we may need to adapt our provision to ensure students remain safe. For the most up-to-date information on COVID-19, regularly visit our website, which we will continue to update as the situation changes www.leeds.ac.uk/covid19faqs

Key facts

Type of research degree
PhD
Application deadline
Sunday 3 July 2022
Project start date
Saturday 1 October 2022
Country eligibility
International (open to all nationalities, including the UK)
Funding
Non-funded
Supervisors
Professor Eric Atwell
Schools
School of Computing
Research groups/institutes
Artificial Intelligence
<h2 class="heading hide-accessible">Summary</h2>

AI text analytics resources and linguistic models are well-developed only for big standard languages like English, German, French etc. For this project, you will develop a toolkit to collect a corpus or text data-set for under-resourced minority language varieties and dialects. This is useful for linguistics research, and also for text analytics applications, for example adapting web search, advertising, digital education etc to end-user languages and dialects. Linguists researching indigenous or local dialects and languages have traditionally focused on listening to informants and audio recordings of dialect speakers as sources of evidence. Few written text corpus resources exist for dialects, because most dialect speakers are taught formal writing in Modern Standard language; for example most Arabic text is written in Modern Standard Arabic, and not in Arabic dialects. Social media like Twitter and FaceBook have enabled the general public to write informally including in dialects, and this offers a new data source for study and modelling of dialects and language varieties. It is possible to collect large amounts of data: millions of words of text can be scraped from Twitter, FaceBook etc; eg (Alshutayri and Atwell 2017 2018 2019). Also, there is no need to transcribe audio recordings (Alshutayri et al 2016), as the data is already text. But there are also disadvantages. Some dialect research focuses on phonetic variation; this is captured only indirectly in dialect social media text. Tweets may include non-dialect data, such as code-switching between dialect, MSA, and/or English/French/etc; emojis, URLs and other non-language. Social media posts may include personal data, so needs to be anonymized, and permissions need to be acquired. It is not straightforward to identify a set of tweets representative of a specific dialect. For example, one strategy is to compile a seed-list of words used only in the given dialect, then collect only tweets including one or more of these dialect words; but this leaves out many tweets by speakers of the dialect, and results in a corpus with frequencies skewed towards the seed words. Another strategy is to collect from specific locations, for example main cities in a dialect region; this can include dialect tweets lacking the seed terms. However this also includes tweets by visitors who are not dialect speakers. For example not all tweets in Riyadh are Najdi Arabic speakers; and it leaves out dialect tweets from outside main population centres, and dialect diaspora speakers who have left home.

<h2 class="heading hide-accessible">Full description</h2>

<p>Your project will investigate these and other challenges, and solve them in a methodology and software toolkit to capture a repository of minority dialects. &nbsp;For example, you could collect an Arabic Dialect Corpus, with representative samples of each Arabic dialect; or an International Corpus of English, with representative samples of each national variety of English. For this project, you will be delivering data, software and results to one or more dialect researchers, to aid their research into under-resourced dialects in languages such as Russian (Sharoff 2018), Ukranian (Babych 2017), Mehri (Watson 2012), Arabic (Dickins 2010), and/or English (Douglas 2009).&nbsp;</p> <p>Machine Learning language models should be learnable from any language, but the training resources and linguistic models are well-developed only for &ldquo;big&rdquo; languages like English, German, French etc. Deep Learning applied to a text corpus produces word embedding and phrase-embedding models; these are widely used in Natural Language Processing, as a way to capture the meaning of a word, phrase or short text as a vector of numbers. Deep Learning usually requires a large corpus to extract or learn the embeddings; for under-resourced languages and dialects, we may not have enough data for standard Deep Learning. Adaptation or Transfer Learning may improve the models of lesser-resourced languages by taking into account the resources available for closely related languages (Rios and Sharoff 2016). Having collected a dialect corpus, you can explore methods to learn embeddings from limited data-sets, and/or transfer embedding models learnt from &ldquo;big languages&rdquo; to handle related &ldquo;small languages&rdquo; (Sharoff 2018, Adams et al 2017). Transfer Learning may be informed by linguistic knowledge of the target language and mappings from source to target, formalized into Transfer Learning representations (Yang et al 2016, Ruder et al 2017, Alosaimy and Atwell 2017), which may make use of morphological or sub-word patterns (Soricut and Och 2015, Bojanowski et al 2017). For example, Transfer Learning from Russian to Ukranian (Babych 2017), or from Arabic to Mehri (Watson 2012); or from a standard language model for standard English or standard Arabic to an under-resourced dialect, such as Classical Arabic (Alosaimy and Atwell 2017), Sudanese Arabic (Dickins 2010), New Zealand English (Vine 2017) or Scottish English (Douglas 2009).</p> <p>This will enable researchers and industry to extend Deep Learning NLP and Text Analytics methods and tools to a wider range of minority languages and dialects, for tasks such as language/dialect identification (Tulkens et al. 2016), semantic tagging and parsing of texts (Bordes et al 2012), clustering or classification of texts (Wang et al 2015), learning and understanding the Quran and the Bible (Alturayeif 2017) and Lexical Computing (SketchEngine 2017).</p> <p>REFERENCES</p> <p>Adams, O. et al. (2017). Cross-Lingual Word Embeddings for Low-Resource Language Modeling.</p> <p>Alosaimy, A. and Atwell, E. (2017). Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers.</p> <p>Alshutayri A, et al (2016). Arabic Language WEKA-Based Dialect Classifier for Arabic Automatic Speech Recognition Transcripts. Proceedings VarDial&rsquo;2016, pp. 204-211</p> <p>Alshutayri A, Atwell E. (2017). Exploring Twitter as a Source of an Arabic Dialect Corpus. International Journal of Computational Linguistics (IJCL). 8(2), pp. 37-44</p> <p>Alshutayri A, Atwell E. (2018). Arabic dialects annotation using an online game. Proceedings of ICNLSP&rsquo;2018</p> <p>Alshutayri A, Atwell E. 2018. Creating an Arabic Dialect Text Corpus by Exploring Twitter, Facebook, and Online Newspapers. Proceedings of OSACT&rsquo;2018</p> <p>Alshutayri A, Atwell E. (2019). A Social Media Corpus of Arabic Dialect Text. In: Stemle E; Wigham C (eds.) Computer-mediated communication: building corpora for sociolinguistic analysis</p> <p>Alturayeif, N. (2017). Text Mining and Similarity Measures of Quran and Bible.</p> <p>Atwell E. 2018. Classical and modern Arabic corpora: Genre and language change. In: Whitt RJ (eds.) Diachronic Corpora, Genre, and Language Change. Studies in Corpus Linguistics. John Benjamins, pp. 65-91</p> <p>Atwell E. 2019. Using the Web to model Modern and Qur?anic Arabic. In: McEnery T; Hardie A; Younis N (eds.) Arabic Corpus Linguistics. Edinburgh University Press, pp. 100-119</p> <p>Babych, B. (2017). Unsupervised induction of morphological lexicon for Ukrainian. To appear in Proc CAMRL&rsquo;2017</p> <p>Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information.</p> <p>Bordes, A. et al. (2012). Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing.</p> <p>Dickins, J. (2010). Basic Sentence Structure in Sudanese Arabic.</p> <p>Douglas, F. (2009). Scottish Newspapers, Language and Identity.</p> <p>Rios, M. and Sharoff, S. (2016). Language adaptation for extending post-editing estimates for closely related languages.</p> <p>Ruder, S. et al. (2017). A Survey of Cross-lingual Word Embedding Models.</p> <p>Sharoff, S. (2018). Language adaptation experiments via cross-lingual embeddings for related languages.</p> <p>SketchEngine. (2017). Embedding Viewer.</p> <p>Soricut, R.&nbsp; and Och, F. (2015). Unsupervised morphology induction using word embeddings.</p> <p>Tarmom T et al. (2020). Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study. Natural Language Engineering journal</p> <p>Tulkens, S. et al. (2016). Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource.</p> <p>Wang, P. et al. (2015). Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification.</p> <p>Vine, B. (2017). Archive of New Zealand English.</p> <p>Watson, J. (2012). The structure of Mehri.</p> <p>Yang, Z. et al. (2016). Multi-Task Cross-Lingual Sequence Tagging from Scratch.</p> <p>&nbsp;</p>

<h2 class="heading">How to apply</h2>

<p>Formal applications for research degree study should be made online through the&nbsp;<a href="https://www.leeds.ac.uk/research-applying/doc/applying-research-degrees">University&#39;s website</a>. Please state clearly in the research information section&nbsp;that the research degree you wish to be considered for is <em><strong>AI-Dialect: Social media corpus resources for minority language varieties and dialects</strong></em> as well as <a href="https://eps.leeds.ac.uk/computing/staff/33/professor-eric-atwell">Professor Eric Atwell</a> as your proposed supervisor.</p> <p>If English is not your first language, you must provide evidence that you meet the University&#39;s minimum English language requirements (below).</p> <p>&nbsp;</p>

<h2 class="heading heading--sm">Entry requirements</h2>

Applicants to research degree programmes should normally have at least a first class or an upper second class British Bachelors Honours degree (or equivalent) in an appropriate discipline. The criteria for entry for some research degrees may be higher, for example, several faculties, also require a Masters degree. Applicants are advised to check with the relevant School prior to making an application. Applicants who are uncertain about the requirements for a particular research degree are advised to contact the School or Graduate School prior to making an application.

<h2 class="heading heading--sm">English language requirements</h2>

The minimum English language entry requirement for research postgraduate research study is an IELTS of 6.5 overall with at least 6.5 in writing and at 6.0 in reading, listening and speaking) or equivalent. The test must be dated within two years of the start date of the course in order to be valid. Some schools and faculties have a higher requirement.

<h2 class="heading">Funding on offer</h2>

<p><strong>Self-Funded or externally sponsored students are welcome to apply.</strong></p> <p><strong>UK</strong>&nbsp;&ndash;&nbsp;The&nbsp;<a href="https://phd.leeds.ac.uk/funding/209-leeds-doctoral-scholarships-2022">Leeds Doctoral Scholarships</a>, <a href="https://phd.leeds.ac.uk/funding/53-school-of-computing-scholarship">School of Computing Scholarship&nbsp;</a>, <a href="https://phd.leeds.ac.uk/funding/198-akroyd-and-brown-scholarship-2022">Akroyd &amp; Brown</a>, <a href="https://phd.leeds.ac.uk/funding/199-frank-parkinson-scholarship-2022">Frank Parkinson</a> and <a href="https://phd.leeds.ac.uk/funding/204-boothman-reynolds-and-smithells-scholarship-2022">Boothman, Reynolds &amp; Smithells</a> Scholarships are available to UK applicants. &nbsp;<a href="https://phd.leeds.ac.uk/funding/60-alumni-bursary">Alumni Bursary</a> is available to graduates of the University of Leeds.&nbsp;</p> <p><strong>Non-UK</strong>&nbsp;&ndash; The&nbsp;<a href="https://phd.leeds.ac.uk/funding/53-school-of-computing-scholarship">School of Computing Scholarship&nbsp;</a>&nbsp;is available to support the additional academic fees of Non-UK applicants. The&nbsp;<a href="https://phd.leeds.ac.uk/funding/48-china-scholarship-council-university-of-leeds-scholarships-2021">China Scholarship Council - University of Leeds Scholarship</a>&nbsp;is available to nationals of China. The&nbsp;<a href="https://phd.leeds.ac.uk/funding/73-leeds-marshall-scholarship">Leeds Marshall Scholarship</a>&nbsp;is available to support US citizens. <a href="https://phd.leeds.ac.uk/funding/60-alumni-bursary">Alumni Bursary</a> is available to graduates of the University of Leeds.</p>

<h2 class="heading">Contact details</h2>

<p>For further information regarding your application, please contact Doctoral College Admissions<br /> e:&nbsp;<a href="mailto:phd@engineering.leeds.ac.uk">phd@engineering.leeds.ac.uk</a>, t: +44 (0)113 343 5057.</p> <p>For further information regarding the project, please contact Professor Eric Atwell<br /> e:&nbsp;<a href="mailto:E.S.Atwell@leeds.ac.uk">E.S.Atwell@leeds.ac.uk</a></p>


<h3 class="heading heading--sm">Linked research areas</h3>