JaSlo: Integration of a Japanese-Slovene Bilingual Dictionary with a Corpus Search System

  • Kristina HMELJAK SANGAWA University of Ljubljana
  • Tomaž ERJAVEC Jožef Stefan Institute
Keywords: bilingual lexicography, corpus search, parallel corpus, readability level

Abstract

The paper presents a set of integrated on-line language resources targeted at Japanese language learners, primarily those whose mother tongue is Slovene. The resources consist of the on-line Japanese-Slovene learners’ dictionary jaSlo and two corpora, a 1 million word Japanese-Slovene parallel corpus and a 300 million word corpus of web pages, where each word and sentence is marked by its difficulty level; this corpus is furthermore available as a set of five distinct corpora, each one containing sentences of the particular level. The corpora are available for exploration through NoSketch Engine, the open source version of the commercial state-of-the-art corpus analysis software Sketch Engine. The dictionary is available for Web searching, and dictionary entries have direct links to examples from the corpora, thus offering a wider picture of a) possible translations in concrete contextualised examples, and b) monolingual Japanese usage examples of different difficulty levels to support language learning.

Downloads

Download data is not yet available.

References

Adamska-Sałaciak, A. (2006). Translation of dictionary examples - Notoriously unreliable? In E. Corino, C. Marello, & C. Onesti (Eds.), Proceedings of the Twelfth EURALEX International Congress, Torino, Italia, September 6th - 9th, 2006 (pp. 493-501). Alessandria: Edizioni dell’Orso.

Baker, M. (1995). Corpora in translation studies: An overview and some suggestions for future research. Target 7(2), 223-243.

Baroni, M. & Kilgarriff, A. (2006). Large linguistically-processed web corpora for multiple languages. In Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics (pp. 87-90). Stroudsburg: Association for Computational Linguistics.

Bernardini, S. & Castagnoli, S. (2008). Corpora for translator education and translation practice. In E. Yuste-Rodrigo (Ed.), Topics in language resources for translation and localisation (pp. 39-55). Amsterdam / Philadelphia: Benjamins.

Breen, J. (2004). JMdict: a Japanese-multilingual dictionary. In G. Sérasset (Ed.), Proceedings of the workshop on multilingual linguistic resources (pp. 71-79). Stroudsburg, PA, USA: Association for Computational Linguistics.

Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system. In Proceedings of the Conference in Computational Lexicography, COMPLEX’94 (pp. 23–32). Budapest: Hungarian Academy of Sciences.

Church, K. & Gale, W. (1991). Identifying Word Correspondences in Parallel Texts. In P. Price (Ed.), Proceedings, DARPA Speech and Natural Language Workshop (pp. 152-157). San Mateo, CA: Morgan Kaufmann.

Citron, S. & Widmann, T. (2006). A bilingual corpus for lexicographers. In E. Corino, C. Marello, & C. Onesti (Eds.), Proceedings of XII EURALEX International Congress (pp. 251-255). Alessandria: Edizioni dell’Orso.

Corréard, M.-H.. (2006). Bilingual lexicography. In K. Brown (Ed.), Encyclopedia of language and linguistics (2nd ed., vol.1, pp. 787-796). Amsterdam: Elsevier.

Erjavec, T. (in print): Vzporedni korpus SPOOK: označevanje, zapis in iskanje. In Š. Vintar (Ed.), Slovenski prevodi skozi korpusno prizmo. Ljubljana: Znanstvena založba Filozofske fakultete.

Erjavec, T., Hmeljak Sangawa, K., & Srdanović, I. (2003). An XML TEI encoding of a Japanese-Slovene learners’ dictionary. In V. Rajkovič (Ed.), Information Society 2003 Proceedings Volume B (pp. 20-26). Ljubljana: Institut Jožef Stefan.

Erjavec, T., Ignat, C., Pouliquen, B., & Steinberger, R. (2005). Massive multi-lingual corpus compilation: Acquis Communautaire and ToTaLe. In Proceedings of the 2nd Language & Technology Conference, April 21-23, 2005 (pp. 32-36). Poznań: Wydawnictwo Poznańskie.

Erjavec, T., Hmeljak Sangawa, K., & Srdanović, I. (2006). jaSlo, a Japanese-Slovene Learners’ Dictionary: Methods for Dictionary Enhancement. In E. Corino, C. Marello, & C. Onesti (Eds.), Proceedings of the Twelfth EURALEX International Congress, Torino, Italia, September 6th - 9th, 2006 (pp. 611-616). Alessandria: Edizioni dell’Orso.

Ferraresi, A., Bernardini, S., Picci, G., & Baroni, M. (2008). Web corpora for bilingual lexicography: A pilot study of English/French collocation extraction and translation. In The International Symposium on Using Corpora in Contrastive and Translation Studies 25th -- 27th September 2008, Zhejiang University, China. Retrieved from http://www.sis.zju.edu.cn/sis/sisht/dlwy/UCCTS2008papers/UCCTS%20Ferraresi_et_al.pdf

Geyken, A. & Lemnitzer, L. (2012). Using Google books unigrams to improve the update of large monolingual reference dictionaries. In R. V. Fjeld, & J. M. Torjusen (Eds.), Proceedings of the 15th EURALEX International Congress (pp. 362-366). Oslo: Department of Linguistics and Scandinavian Studies, University of Oslo.

Hartmann, R.R.K. (1994). The use of parallel text corpora in the generation of translation equivalents for bilingual lexicography. In W. Martin, et al. (Eds.), Euralex 1994 Proceedings (pp. 291-297). Amsterdam: Vrije Universiteit.

Hartmann, R.R.K. (1996). Contrastive textology and corpus linguistics: On the value of parallel texts. Language Sciences 18(3-4), 947-957.

Héja, E. & Takács, D. (2012). An online dictionary browser for automatically generated bilingual dictionaries. In R. V. Fjeld, & J. M. Torjusen (Eds.), Proceedings of the 15th EURALEX International Congress (pp. 468--477). Oslo: Department of Linguistics and Scandinavian Studies, University of Oslo.

Hmeljak Sangawa, K. & Erjavec, T. (2008). 学習者用日本語辞書のための対訳例文獲得 [Gakushūshayō nihongojisho no tame no taiyaku reibun kakutoku]. In Proceedings of the Workshop on Natural Language Processing for Education, co-located with the 14th Annual Meeting of The Association for Natural Language Processing, 21 March 2008, University of Tokyo (pp. 19-22). Tokyo: The Association for Natural Language Processing.

Hmeljak Sangawa, K., Erjavec, T., & Kawamura, Y. (2009). Automated collection of Japanese word usage examples from a parallel and a monolingual corpus. In S. Granger, & M. Paquot (Eds.), eLexicography in the 21st century : new challenges, new applications : Proceedings of eLex 2009 (pp. 137-147). Louvain: Presses Universitaires de Louvain.

Hmeljak-Sangawa, K. & Erjavec, T. (2010). The Japanese-Slovene dictionary jaSlo: Its development, enhancement and use, Studia Kognitywne = Études Cognitives 10, 211-224.

Imbs, P., et al. (Eds.). (1971-1994). Trésor de la langue française. (16 vols.) Paris: CNRS - Gallimard.

Japan Foundation, & Association of International Education Japan. (2004). Japanese Language Proficiency Test Content Specifications (Revised ed.). Tokyo: Bonjinsha.

Kilgarriff, A., Pomikálek, J., Jakubíček, M., & Whitelock, P. (2012). Setting up for corpus lexicography. In: R. V. Fjeld, & J. M. Torjusen (Eds.), Proceedings of the 15th EURALEX International Congress (pp. 778-785) Oslo: Department of Linguistics and Scandinavian Studies, University of Oslo.

Krishnamurty, R. (2006). Corpus lexicography. In K. Brown (Ed.), Encyclopedia of Language and Linguistics (2nd ed., Vol. 3, pp. 250-254). Amsterdam: Elsevier.

Matsumoto, Y., Takaoka, K., & Asahara, M. (2007). Chasen - Japanese Morphological Analyzer. v. 2.4.0. [http://chasen-legacy.sourceforge.jp/]

Perko, G. & Mezeg, A. (2012). Uporaba francosko-slovenskega vzporednega korpusa pri slovarski analizi nekaterih mejnih področij idiomatike. In M. Šorli (Ed.), Dvojezična korpusna leksikografija. Slovenščina v kontrastu: novi izzivi, novi obeti (pp. 12-34). Ljubljana: Trojina.

Roberts, R. (1996). Parallel-text analysis and bilingual lexicography. In Papers presented at AILA 1996. Retrieved from http://www.dico.uottawa.ca/articles-fr.htm

Roberts, R. & Cormier, M. (1999). L’analyse des corpus pour l’élaboration du Dictionnaire canadien bilingue. Retrieved from http://www.dico.uottawa.ca/articles/paris99.zip

Rundell, M. & Kilgarriff, A. (2011). Automating the creation of dictionaries: Where will it all end? In F. Meunier, et al. (Eds.), A Taste for Corpora: In honour of Sylviane Granger (pp. 257-281). Amsterdam: John Benjamins.

Rychlý, P. (2007). Manatee/Bonito, a modular corpus manager. In Proceedings of 1st workshop on recent advances in Slavonic natural language processing (pp. 65-70). Brno: Masaryk University. 65-70.

Salkie, R. (2008). How can lexicographers use a translation corpus? In The international symposium on using corpora in contrastive and translation studies 25th -- 27th September 2008, Zhejiang University, China.Retrieved from http://www.sis.zju.edu.cn/sis/sisht/dlwy/UCCTS2008papers/UCCTS%20Salkie.pdf

Sharoff, S. (2006). Open-source corpora: Using the net to fish for linguistic data, International Journal of Corpus Linguistics, 11(4), 435-462.

Sinclair, J. (Ed.). (1987). Looking up: An Account of the Cobuild Project in Lexical Computing. London: Collins ELT.

Srdanović, I. (2012). Dvojezična korpusna leksikografija in japonski jezik: model za izdelavo japonsko-slovenskega slovarja kolokacij. In Šorli, M. (Ed.), Dvojezična korpusna leksikografija. Slovenščina v kontrastu: novi izzivi, novi obeti (pp. 117-133). Ljubljana: Trojina.

Srdanović, I., Erjavec, T., & Kilgarriff, A. (2008). A web corpus and word sketches for Japanese, Journal of Natural Language Processing - 自然言語処理, 15(2), 137-159.

Sunakawa, Y., Lee, J.-H., & Takahara, M. (2012). The Construction of a Database to Support the Compilation of Japanese Learners’ Dictionaries, Acta Linguistica Asiatica 2(2), 97-115. Retrieved from http://revije.ff.uni-lj.si/ala/article/view/174

Šorli, M. (2012). Semantična prozodija v teoriji in praksi - korpusni pristop k proučevanju pragmatičnega pomena: primer slovenščine in angleščine. In M. Šorli (Ed.), Dvojezična korpusna leksikografija. Slovenščina v kontrastu: novi izzivi, novi obeti (pp. 90-116). Ljubljana: Trojina.

TEI Consortium. (2011). TEI P5: Guidelines for Electronic Text Encoding and Interchange: Version 1.9.1. Retrieved from http://www.tei-c.org/Guidelines/P5/

Wu, D. & Xia, X. (1994). Learning an English-Chinese lexicon from a parallel corpus. In AMATA-94: Proceedings of the First Conference of the Association for Machine Translation in the Americas (pp. 206-213). Columbia: AMT.

Zanettin, F. (2002). Corpora in translation practice. In E. Yuste-Rodrigo (Ed.), Language resources for translation work and research - LREC workshop #8 (pp. 10-14). Retrieved from http://www.lrec-conf.org/proceedings/lrec2002/pdf/ws8.pdf
Published
2012-12-20
How to Cite
HMELJAK SANGAWA, K., & ERJAVEC, T. (2012). JaSlo: Integration of a Japanese-Slovene Bilingual Dictionary with a Corpus Search System. Acta Linguistica Asiatica, 2(3), 125-140. https://doi.org/10.4312/ala.2.3.125-140
Section
Research Articles (Project Reports)