Semi-Semantic Annotation: A guideline for the URDU.KON-TB treebank POS annotation

Qaiser ABBAS


This work elaborates the semi-semantic part of speech annotation guidelines for the URDU.KON-TB treebank: an annotated corpus. A hierarchical annotation scheme was designed to label the part of speech and then applied on the corpus. This raw corpus was collected from the Urdu Wikipedia and the Jang newspaper and then annotated with the proposed semi-semantic part of speech labels. The corpus contains text of local & international news, social stories, sports, culture, finance, religion, traveling, etc. This exercise finally contributed a part of speech annotation to the URDU.KON-TB treebank. Twenty-two main part of speech categories are divided into subcategories, which conclude the morphological, and semantical information encoded in it. This article reports the annotation guidelines in major; however, it also briefs the development of the URDU.KON-TB treebank, which includes the raw corpus collection, designing & employment of annotation scheme and finally, its statistical evaluation and results. The guidelines presented as follows, will be useful for linguistic community to annotate the sentences not only for the national language Urdu but for the other indigenous languages like Punjab, Sindhi, Pashto, etc., as well.


semi-semantic part of speech; rich information; deep learning; parsing aid; linguistically motivated annotation; humanistic annotation

Full Text:



Aarts, B., Chalker, S., & Weiner, E. (2014). The Oxford Dictionary Of English Grammar. Oxford University Press.

Abbas, Q. (2012, March). Building a hierarchical annotated corpus of urdu: the URDU. KON-TB treebank. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 66-79). Springer Berlin Heidelberg.

Abbas, Q. (2014a). Semi-semantic part of speech annotation and evaluation. LAW VIII, 75.

Abbas, Q. (2014b). Building Computational Resources: The URDU. KON-TB Treebank and the Urdu Parser (Doctoral dissertation).

Abbas, Q. (2014c). Exploiting language variants via grammar parsing having morphologically rich information. LT4CloseLang 2014, 36.

Abbas, Q. (2014d). A Stochastic Prediction Interface for Urdu. International Journal of Intelligent Systems and Applications, 7(1), 94.

Abbas, Q. (2015). Morphologically rich Urdu grammar parsing using Earley algorithm, Natural Language Engineering (NLE), Vol.21(2), PP.1-36, Cambridge University Press, UK

Abbas, Q., & Khan, A. N. (2009). Lexical functional grammar for Urdu modal verbs. In Emerging Technologies, 2009. ICET 2009. International Conference on (pp. 7-12). IEEE.

Abbas, Q., & Raza, G. (2014). A Computational Classification of Dynamic Urdu Copula Verb. International Journal of Computer Applications, 85(10).

Abbas, Q., Ahmed, M. S., & Niazi, S. (2010). Language Identifier For Languages Of Pakistan Including Arabic And Persian. International Journal of Computational Linguistics (IJCL), 1(03), 27-35.

Abbas, Q., Karamat, N., & Niazi, S. (2009). Development of Tree-bank based probabilistic grammar for Urdu Language. International Journal of Electrical & Computer Science, 9(09), 231-235.

Abbas, Q., Zia, T., & Khan, A. N. (2014). Syntactic and semantic analysis of Urdu modal verbs using XLE parser. International Journal of Computer Applications, 107(10).

Abbi, A. (1992). Reduplication in South Asian Languages: An Areal, Typological, And Historical Study. Allied Publishers, New Delhi.

Ahmed, T., & Butt, M. (2011, January). Discovering semantic classes for Urdu NV complex predicates. In Proceedings of the Ninth International Conference on Computational Semantics (pp. 305-309). Association for Computational Linguistics.

Bhatt, R., Bögel, T., Butt, M., Hautli, A., Sulger, S., & King, T. H. (2011). Urdu/Hindi modals. Bibliothek der Universität Konstanz.

Bögel, T., Butt, M., Hautli, A., & Sulger, S. (2007). Developing a finite-state morphological analyzer for Urdu and Hindi. Finite State Methods and Natural Language Processing, 86.

Butt, M. (1995). The structure of complex predicates in Urdu. Center for the Study of Language (CSLI).

Butt, M. (2003). The light verb jungle [OL]. Butt, M. (2010). The light verb jungle: Still hacking away. Complex predicates in cross-linguistic perspective, 48-78.

Butt, M., & King, T. H. (2004). The status of case. In Clause structure in South Asian languages (pp. 153-198). Springer Netherlands.

Butt, M., & Ramchand, G. (2001). Complex aspectual structure in Hindi/Urdu. M. Liakata, B. Jensen, & D. Maillat, Eds, 1-30.

Butt, M., & Rizvi, J. (2010). Tense and aspect in Urdu. Layers of aspect, 43-66. Stanford: CSLI Publications.

Butt, M., & Sadler, L. (2003). Verbal morphology and agreement in Urdu. Syntactic structures and morphological information. Mouton, 57-100.

Clark, A., Fox, C., & Lappin, S. (2010). The Handbook Of Computational Linguistics And Natural Language Processing, 57.

Facchinetti, R., Palmer, F., & Krug, M. (Eds.). (2003). Modality in contemporary English (Vol. 44). Walter de Gruyter.

Hayes, A. F., & Krippendorf, K. (2007). Answering The Call For A Standard Reliability Measure For Coding Data. Communication Methods and Measures, 1(1), 77–89.

Hirsch, E. D., Kett, J. F., & Trefil, J. S. (2014). The new dictionary of cultural literacy. Houghton Mifflin Harcourt.

Ijaz, M., & Hussain, S. (2007, August). Corpus based Urdu lexicon development. In the Proceedings of Conference on Language Technology (CLT07), University of Peshawar, Pakistan (Vol. 73).

Kamran Malik, M., Ahmed, T., Sulger, S., Bögel, T., Gulzar, A., Raza, G., ... & Butt, M. (2010). Transliterating Urdu for a Broad-Coverage Urdu/Hindi LFG Grammar. In LREC 2010, Seventh International Conference on Language Resources and Evaluation (pp. 2921- 2927).

Krippendorff, K. (2004). Reliability in content analysis. Human communication research, 30(3), 411-433.

Leech, G. (2005). Adding linguistic annotation. , 17-29, Oxbow Books, Oxford.

Matthews, P. H. (2007). The concise Oxford dictionary of linguistics. Oxford University Press.

Mikulova, M., & Stepanek, J. (2010). Ways Of Evaluation Of The Annotators In Building The Prague Czech-English Dependency Treebank. In LREC.

Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An on-line lexical database. International journal of lexicography, 3(4), 235- 244.

Mohanan, T. (1994). Argument structure in Hindi. Center for the Study of Language (CSLI).

Raza, G. (2010). Inferring Subcat Frames of Verbs in Urdu. In LREC.

Raza, G. (2011). Subcategorization acquisition and classes of predication in Urdu (Doctoral dissertation).

Schmidt, R. L. (2013). Urdu, an Essential Grammar. Psychology Press.

Skut, W., Krenn, B., Brants, T., & Uszkoreit, H. (1997, March). An annotation scheme for free word order languages. In Proceedings of the fifth conference on Applied natural language processing (pp. 88-95). Association for Computational Linguistics.

Stevenson, A. (Ed.). (2010). Oxford dictionary of English. Oxford University Press, USA.

Urooj, S., Hussain, S., Adeeba, F., Jabeen, F., & Parveen, R. (2012). CLE Urdu digest corpus. LANGUAGE & TECHNOLOGY, 47.

Zia, T, Akhtar, M. P., Abbas, Q. (2015a). Comparative Study of Feature Selection Approaches for Urdu Text Categorization. Malaysian Journal of Computer Science, 28(2).

Zia, T., Abbas, Q., & Akhtar, M. P. (2015b). Evaluation of Feature Selection Approaches for Urdu Text Categorization. International Journal of Intelligent Systems and Applications, 7(6), 33.



  • There are currently no refbacks.

Copyright (c) 2016 Qaiser Abbas, Miriam Butt

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Ljubljana University Press, Faculty of Arts
(Znanstvena založba Filozofske fakultete Univerze v Ljubljani) 

Online ISSN: 2232-3317