TVITERAŠI OR TWITTERAŠI ? PRODUCING AND ANALYSING A NORMALISED DATASET OF CROATIAN AND SERBIAN TWEETS

In this paper we discuss the parallel manual normalisation of samples extracted from Croatian and Serbian Twitter corpora. We describe the datasets, outline the unified guidelines provided to annotators, and present a series of analyses of standard-to-non-standard transformations found in the Twitter data. The results show that closed part-of-speech classes are transformed more frequently than the open classes, that the most frequently transformed lemmas are auxiliary and modal verbs, interjections, particles and pronouns, that character deletions are more frequent than insertions and replacements, and that more transformations occur at the word end than in other positions. Croatian and Serbian are found to share many, but not all transformation patterns; while some of the discrepancies can be ascribed to the structural differences between the two languages, others appear to be better explained by looking at extralinguistic factors. The produced datasets and their initial analyses can be used for studying the properties of nonstandard language, as well as for developing language technologies for nonstandard data.


I N T R O D U C T I O N
Since the beginning of its wider use, computer-mediated communication (CMC) has been attracting a lot of attention in fields ranging from communication studies to natural language processing (NLP).On the one hand, CMC is seen as an important source of knowledge and opinions (Crystal 2011); on the other hand, its lexical and structural properties are a well-established research topic in linguistics and NLP.CMC occurs under special technical and social circumstances (Noblia 1998), imposing specific communicative needs and practices (Tagg 2012); as a consequence, its language often deviates from the norms of traditional text production, instantiating numerous non-standard features at all levels, from unorthodox spelling to colloquial and other out-ofvocabulary (OOV) lexis, as well as simplified syntax (see e.g.Kaufmann, Kalita 2010).
The non-standard features of CMC are particularly important for NLP, as deviations from the norm make CMC difficult to process automatically, and tools developed for standard languages have a notoriously poor performance when applied to CMC data.This is evidenced by decreases in performance in the entire text processing chain, from tokenisation (Eisenstein 2013) and partof-speech tagging (Gimpel et al. 2011) to sentence parsing (Petrov, McDonald 2012).The non-standard features of CMC have been analyzed both qualitatively and quantitatively (Eisenstein 2013;Hu et al. 2013), and different strategies have been proposed for dealing with non-standardness: adapting standard tools to work on non-standard data (Gimpel et al. 2011), using pre-processing steps to tackle CMC-specific phenomena (Foster et al. 2011), and normalising CMC corpora, i.e. using a dedicated annotation level in which standard forms are assigned to non-standard words (Kaufmann, Kalita 2010;Liu et al. 2011).
In this paper we adopt the normalisation-based approach, focusing on Twitter messages (tweets) written in Croatian and Serbian.As one of the most widely used CMC platforms, Twitter has already received a lot of attention in NLP.The number of tweets published per day are counted in hundreds of millions (Benhardus, Kalita 2013), and the content ranges from news broadcasts and official announcements by companies and institutions to personal thoughts and opinions the users share, making Twitter a rich source of data for NLP tasks related to text mining.To enable these tasks to be performed, automatic lowerlevel processing is a must, meaning in turn that the problem of nonstandardness needs to be solved.In the specific case of Twitter, an additional component influencing the structural properties of its language is that messages are constrained by the length restriction of 140 characters.Given the recent availability of basic language tools for standard Croatian and Serbian, a normalisation-based approach was deemed more cost-efficient than an adaptation of standard language tools.Additionally, performing normalisation gives researchers easy access to deflections from standard language occurring in non-standard one.
Examples of tweets containing non-standard features in Croatian and Serbian are shown in Table 1.These features include phenomena typical of CMC in general, such as phonetic spelling of foreign words (e.g.fešn for fashion), abbreviations (e.g.zg for Zagreb), @ name mentions and emoticons, but also phenomena typical of Twitter like hashtags and some terms (e.g.fave), as well as some language-specific features, such as omission of diacritics (which occurs in both Croatian and Serbian, e.g.kauc for kauč -couch), and the use of fully language-specific dialectal and colloquial non-standard forms (e.g. the Ikavian dialectal form isprid for ispred -in front of in Croatian).

Croatian Serbian
-ei [ej]   With the future goal of developing tools for automatic CMC normalisation, we manually normalised a sample of 4000 tweets per language.In the remainder of the paper we first describe the corpus the tweets were sampled from and the samples themselves, moving on to the procedure and the unified Croatian and Serbian guidelines used in the manual normalisation.We then present several initial analyses based on the normalisation outcomes; the analyses were performed starting from the normalised forms and looking towards forms found in the Twitter datasets.Specifically, we look at the distribution of standard -> non-standard transformations across parts of speech and lemmas, as well as the distribution of transformation subtypes (deletions vs. insertions vs. replacements), and we compare Croatian and Serbian.As very little related previous work exists for these languages, our main goals are to give an overview of the key trends, and to compare these trends in the two languages, facilitating the formulation of future specific linguistic hypotheses.

C O R P U S C O N S T R U C T I O N A N D S A M P L I N G
The corpus we employ comprises Croatian and Serbian tweets harvested with TweetCat (Ljubešić et al. 2014b), a custom-built tool for collecting tweets written in lesser-used languages.The collection of tweets for both languages took place from 2013 to 2015, resulting in a corpus of about 25 million tokens in Croatian and 205 million tokens in Serbian, after deduplication and the filtering of foreign-language tweets and tweets without linguistically relevant content (i.e.those containing only photos, links, or emoticons).
The sample we used for the manual normalisation task contained a total of 4000 tweets per language, split into four categories with 1000 tweets each.The categories were based on automatically assigned levels of technical (T) and linguistic (L) standardness (Ljubešić et al. 2015), so that 1000 tweets belonged to each of the T1L1, T1L3, T3L1 and T3L3 combinations, with the marks being 1= standard and 3=very non-standard (for more detail about the annotation of standardness levels in Twitter corpora of Croatian, Serbian and Slovene see Fišer et al. 2015).These specific categories were included with the goal of sufficiently representing non-standard forms, given that it has been shown that the language of tweets is mostly very standard in Serbian (67% of tweets being annotated with L1, and 30% with L2), and in particular Croatian (73% of tweets being annotated with L1, and 21% with L2), where Twitter is frequently used for dissemination of information by news agencies and other official accounts (Fišer et al. 2015).To ensure enough content was available, only tweets over 100 characters long were included in the sample.
Some tweets in the initial sample were deemed as irrelevant for the normalisation task and were excluded from further processing; these were messages that were unintelligible or automatically generated (e.g.news or advert lead-ins), as well as those that were (almost) completely written in a foreign language, and those that contained no linguistic material.After their removal, 3877 tweets (amounting to 89,215 tokens) remained in the Croatian sample, and 3750 tweets (91,877 tokens) in the Serbian one.Finally, due to nonone-to-one mappings (see section 3 for more detail), the token count changed during normalisation, so that the normalised sample comprises 89,542 tokens for Croatian, and 92,236 tokens for Serbian.
After manual normalisation, the normalised sample was automatically linguistically annotated; MSD (morphosyntactic description) tagging and lemmatisation were performed with the tagger and lemmatiser described in Ljubešić et al. (2016b).The accuracy of morphosyntactic tagging (773 different labels) is estimated at ~92% while the part-of-speech tagging (13 different labels) and lemmatisation reach ~98% accuracy.

N O R M A L I S A T I O N P R O C E D U R E A N D G U I D E L I N E S
The manual normalisation was performed using the web-based annotation platform Webanno, which allows users to define their own annotation levels.In our study, three levels were defined: corrections (tokenisation corrections), sentences (sentence segmentation corrections) and normalisation (linguistic normalisation).Guidelines were developed for each of the three levels, explaining both the technical (WebAnno-related) and the content-related side of interventions.Up to four values could be entered per original token at each level.
Each tweet was normalised independently by two annotators.A curation procedure followed, in which the decisions of the different annotators were compared and cases of inter-annotator disagreement were resolved.For Croatian, the curation procedure was coordinated between the two annotators, while for Serbian the task was performed by an independent curator.The guidelines the annotators received are described in the following subsections.

General rules
The annotators were instructed to identify tweets deemed as irrelevant (e.g.due to being automatically generated, see section 2) and mark them for deletion.As for the relevant tweets, overall, a minimal intervention principle was adopted and it was decided not to make corrections that would be impossible, or extremely difficult for a machine learning algorithm to learn.Context was to be taken into account when resolving potentially problematic issues and ambiguous cases (e.g. in Croatian ko -> kao -as, like, in sreću svu širimo ko zarazu -we spread happiness as if it were a contagious disease, but ko -> tko -who in Ko je ljep?-Who is beautiful?);if an issue could not be resolved based on the context, no normalisations were to be made.

Segmentation and tokenisation
Defining tokens and sentences in CMC is less straightforward than in standard language corpora, and automatic procedures are more error-prone.For this reason, automatic tokenisation and segmentation were manually checked and corrected where needed.
Corrections at the sentence segmentation level relied on punctuation, if present, on other symbols (name mentions designated with @, emoticons/emojis, and hashtags) in case they occupied a position where punctuation would normally be found, and on the annotators' intuition if no explicit symbols were used.Annotators were instructed to only insert a sentence boundary when they were fully confident one was needed, and to pay special attention to sentence-internal use of dots (...) and punctuation sequences such as ?!?!, which can indicate pauses or surprise rather than being sentence boundary markers.
As for tokenisation, guidelines were provided for cases known to be problematic: hyphenated inflectional endings for abbreviations (e.g.BMW-uto BMW), cases where vowel omission is marked by an apostophe (e.g.pos'o, from posao -job), and abbreviations ending with a dot (e.g.dr.from drugiother), which often lead to incorrect automatic splitting of a single token into two or three separate ones.An opposite case that was mentioned was that of word combinations containing hyphens, which are sometimes not separated into multiple tokens when they should be.

Linguistic normalisation
The level we focus on in this paper is normalisation.The main goal of manual normalisation was to provide training data for building tools for automatic normalisation of CMC data, but normalisation in general is also important for the end users of CMC corpora, as it enables them to perform queries based on standard forms, much along the lines of dialectal or diachronic data.
In formulating the normalisation guidelines, we tried to strike a balance between the requirements of machine learning algorithms and those of linguistic analysis.The starting point of our work were the guidelines developed for Slovene Twitter data within the JANES project (see Čibej et al. 2016), which were adapted for Croatian and Serbian based on the authors' intuition, consultation with the annotators and other researchers, as well as orthography and grammar manuals of the languages concerned.
Normalisation was restricted to word level, and no word order or syntactic deviations from the standard were corrected.Additional kinds of corrections that were explicitly excluded were those concerning lexical choice (e.g.colloquial words were not 'translated' into their standard equivalents; for instance, komp was not changed into kompjuter -computer), the use of punctuation, usernames and hashtags (regardless of what kind of linguistic material they contained), and ellipsis.In other words, we focused on nonstandard forms that can be seen as spelling deviations, not intervening on OOV items that were not misspelt, on style, or on Twitter-specific phenomena.
Finally, due to the complexity of the rules listed in orthography manuals, we decided not to intervene when it came to capitalisation, leaving everything as is, including lower case letters at sentence beginnings.
The following normalisation rules were applied:  Normalise Croatian/Serbian words making use of foreign letters or letter combinations: shisha -> šiša (he/she cuts hair), chak -> čak (even), kavizzu -> kavicu (coffee)  Normalise non-standard spellings (regardless of whether they are regional forms, phonetic adaptations, or forms containing an obvious typo, and regardless of whether they are intended or non-intended): As can be seen from the examples, several of the above rules lead to non-oneto-one mappings between the original and normalised tokens, affecting the total token count discussed in section 2.

D A T A A N A L Y S I S
In this section we present the results of a series of analyses performed on the manually normalised Croatian and Serbian Twitter datasets.In these analyses we look at (1) original tokens, (2) normalised tokens (up to four tokens per one original token), (3) morphosyntactic descriptions automatically assigned to normalised tokens, and (4) lemmata automatically assigned to normalised tokens.
As explained in section 3.3, the normalisation guidelines we used were formulated in terms of descriptive categories, some of which are difficult or impossible to identify automatically.In the analyses we thus look at the normalisation outcomes using more readily identifiable criteria: parts of speech, specific lemmas and surface forms, Levenshtein transformation types, and the position of transformations within words.While in section 3 we dealt with normalisation, i.e. the assignment of standard language forms to nonstandard ones, in all analyses the focus is on the opposite direction (standard -> non-standard forms), as our the goal is to reconstruct the modifications that take place in non-standard language use compared to the standard; in this case we talk about transformations.

Analysis by part-of-speech
The analysis we dedicate most attention to is based on part-of-speech information assigned to each token in the normalised sample.We first look at part-of-speech distributions in Croatian vs. Serbian CMC, and in CMC vs. Serbian, and a log likelihood value between 3.8 and 6.5 is significant at p<0.05, while a value of 6.6 or more is significant at p<0.01 (Leech et al. 2000: 17;Mair et al. 2002). 3We also compare the Twitter distributions to the part-of-speech distribution in a standard language dataset for Croatian -hr500k (Ljubešić et al. 2016b); given that a comparable standard dataset for Serbian was not available at the time of writing, here we only look at relative frequencies (%), without conducting statistical tests.standard language dataset, this comparison reveals an expected ten times higher percentage of interjections and the already discussed residuals in CMC data.Furthermore, in CMC there are half as many adjectives as in the standard data, about one-third fewer nouns and one-fourth fewer prepositions, while verbs and pronouns are more present in CMC than in the standard data.Such findings are in line with CMC being a largely informal genre, where a high frequency of verbs compared to nouns is expected (see e.g.Biber et al. 1998: 68 for English).
Going back to the Twitter datasets, for each part of speech we also examined the percentages of forms that have been transformed; these results are given in

Analysis by lemma and surface form
The next set of analyses focuses on the most frequent lemmata in each of the resources, as well as their comparison to a standard-language resource.The most frequently normalised lemmas and surface forms are analysed as well.
The lists of the most frequent lemmata in the two Twitter datasets and the hr500k standard Croatian dataset are displayed in Table 4.The most obvious difference between the two languages, not traceable to the difference between CMC and standard language, is the higher frequency of the already discussed conjunction da in Serbian.The most obvious difference between the nonstandard and standard registers is in the pronoun ja (I, me), which has more than 1% of occurrence in both CMC datasets, while it does not make it into the top 20 entries in standard Croatian.Most other lemmata are present in all three lists, with some slight differences in percentage and rank.The biggest difference in percentage can be observed on punctuation, with the full stop and comma being more frequent in standard Croatian than in non-standard Croatian and Serbian.On the other hand, the ellipsis, the exclamation mark and the question mark make it to either both or one of the lists of non-standard data, but not the standard data list.These divergences seem to point to punctuation not being underused in non-standard language, but rather being used somewhat differently, possibly due to its often expressive nature.Tablе 4: The 20 most frequent lemmata in the Croatian and Serbian Twitter datasets and the standard hr500k Croatian dataset.
In Table 5 we show the lemmata that were most frequently transformed in each of the Twitter datasets.For each lemma we report the frequency, overall percentage of the transformed forms this lemma covers, as well as the percentage of all forms of that lemma that were transformed.We again disregard transformations due to diacritic omissions.Tablе 5: The 20 most frequently transformed lemmata.The third numerical column describes the proportion of the lemma occurrences that were transformed.
Many lemmata are present in both lists, with some variation in rank.In Croatian the most frequently transformed lemma is the ellipsis punctuation (...), which occupies the 13th place in Serbian.The overall most frequently transformed forms come from the verb biti (be).In Croatian, biti is followed by a series of function words, while in Serbian two additional verbs make the top five as well: jebati (fuck), mostly due to the high frequency of abbreviations such as jbg (from jebi ga -fuck it), and hteti (want), mostly due to the drop of the initial h, as in oću (hoću -I want) or oće (hoće -he/she wants).The rest of the list mostly consists of function words and Twitter-specific nouns (tweet and Twitter), as well as two proper nouns in Serbian: the name of the current prime minister Aleksandar Vučić (frequently mentioned and sometimes encoded using the initials AV or the form AVučić), and the Serbian capital Belgrade (mostly shortened to Bg or Bgd).
Finally, as for the 20 most frequently transformed surface forms, omitting those that only lack diacritics, they are given in Table 6.Tablе 6: The 20 most frequently transformed surface forms in the Croatian and Serbian Twitter datasets.
While some forms are shared between the two lists -for instance jel (je li -is it), al (ali -but), bi (bih -would), ko (kao -like, also tko -who in Croatian) -(kak for kako -how, tak for tako -like that, ak for ako -if) are specific to Croatian, while abbreviations such as fb (Facebook) and tw (Twitter), min (min. for minute) and god (god.for godina -year), or jbt (jebo te -fuck) and jbg (jebi ga -fuck it) are frequent only in Serbian.

Analysis by transformation type
We start the next analysis by calculating for each language the probability distribution of the three types of Levenshtein transformations -deletions, insertions and replacements (Levenshtein 1966), going from the normalised forms to the forms found in tweets.
The results are summarised in Table 7.The numbers in the first three rows capture all transformations, and show that while deletions and insertions are significantly more frequent in Croatian than in Serbian, the opposite is true for replacements.The fact that Serbian has over 10% more replacements than Croatian can be explained by its already mentioned more pronounced tendency towards diacritic omission.In fact, the numbers in the bottom rows, obtained after we discarded the tokens in which the transformations consisted solely in the omission of diacritics, show partly reversed trends: deletions become more frequent in Serbian, and replacements in Croatian.Overall, the most frequent transformation type is character dropping, followed by replacements, roughly half of which in Croatian, and four fifths in Serbian, are due to omission of diacritics.

Analysis by position of transformation
In the final part of the analysis we focus on the position of transformations (deletions, insertions, replacements) inside the word.Compared to insertions, deletions are more frequently found inside the string, but there is again an emphasis on word end, largely due to final vowel deletions.The corresponding histograms for Serbian can be seen in Figure 2.These histograms show a much less pronounced trend of transformations predominantly being at the end of the string, primarily due to the more frequent omission of diacritics compared to Croatian.This is also reflected in the replacement histogram, where most transformations occur in the second half of the string, but not at its very end.Insertions again have the strongest tendency towards the end of the string, but both insertions and deletions are less biased towards the end than in Croatian.

C O N C L U S I O N
In this paper we presented a sample of Croatian and Serbian tweets manually normalised by following unified annotation guidelines.The produced datasets will be highly useful both for studying the language of CMC and for developing language technologies for CMC data, especially text normalisers that will enable standard language technologies to be used in downstream processing.
We also carried out a series of analyses on the described datasets.Inspecting the overall frequency of transformations, we concluded that Serbian shows a greater tendency towards omitting diacritics, while Croatian is more susceptible to other types of non-standard forms.The distribution of parts of speech in both languages, compared to a standard Croatian dataset, revealed a lower percentage of adjectives and nouns and a higher percentage of verbs in CMC.As for transformations of different parts of speech, most frequent transformations were those on closed part-of-speech classes.Lemma-based analyses showed the most frequently transformed lemmas to be auxiliary and modal verbs, interjections, particles and pronouns.
Focusing on Levenshtein transformations, we observed that, putting aside diacritic omissions, the most frequent transformations were deletions, the amount of insertions and replacements being similar.Deletions consisted mostly of vowel droppings, while insertions were mostly due to vowel repetitions and prolonged interjections; most replacemens were due to diacritic ommissions and regional variants.Finally, we found that transformations mostly occurred at word end, and very infrequently at word beginning, especially in Croatian.Insertions were found to have the most pronounced tendency towards the end, deletions coming second.
These initial analyses are intended to provide a starting point for studies of more specific linguistic phenomena, as well as extralinguistic factors such as user age.In future work we also plan to focus on a lexical analysis of CMC, not captured in our normalisation guidelines, but shown in previous work (Fišer et al. 2015) to be very relevant for Croatian in Serbian, as they both display a higher percentage of lexical than structural non-standard forms.

standard
Croatian.In a second step, we further zoom in on CMC data and compare the distribution of transformations by part of speech in Croatian and Serbian.The results of the comparison of part-of-speech distributions in the Twitter data are shown in Table 2.Both absolute and relative frequencies are shown; the LL column contains the values of the log likelihood statistic, which indicates the degree of significance of the difference between frequencies in Croatian and Serbian data; the +/-sign indicates over/under-use in Croatian compared to

Figure 1 :
Figure 1: Transformations in Croatian by position.

Figure 1
Figure1shows the results for Croatian.The overall trend seen in the first histogram is that transformations mostly occur at the word end, and barely ever at word beginning.Replacements, typically being due to omissions of diacritics, as well as some dialectal transformations, occur inside the word as well, although still more frequently at word end.Insertions have the strongest tendency towards the end of the word; a closer inspection of all strings shows that most insertions are in fact expansions via repetitions of the final vowel.

Figure 2 :
Figure 2: Transformations in Serbian by position.
Comparison of part-of-speech distribution in the Croatian and Serbian Twitter datasets and the standard Croatian hr500k dataset.The results show that the biggest difference in the distribution of parts of speech between Croatian and Serbian CMC data lies in the residuals, a part of speech that, in addition to the standard non-classifiable residuals, covers foreign words, emoticons/smileys, hashtags, @ name mentions and URLs.Looking at specific types of residuals, the biggest difference is observed for URLs, which Moving on to the PoS distributions in the two CMC datasets vs. the hr500k4We thank the two anonymous reviewers for undelining the relevance of these variables, of which age and account status (private vs. corporate) seem to be most promising in terms of data availability.Manual inspections of the corpus content so far indicate that more very young (secondary school age) Twitter users are found in Serbia than in Croatia, while more corporate accounts are present in the Croatian sample.
This is, of course, a very tentative claim, whose further discussion we leave for future work, in which variables such as the users' age, education level and socioeconomic status, as well as the private vs. corporate account status, need to be included.4Amongtheremainingpartsofspeech,a substantial structurally motivated difference is observed on conjunctions, due mostly to da (that), whose relative frequency is twice as high in Serbian as in Croatian (see Table4, section 4.2).Da is used in complex predicates in combination with the present tense in Serbian; in Croatian, verb infinitives are normally used instead of the da + present tense construction (Ser.mogudauradim = Cro.moguuraditi-I can do).As for the other PoS differences, they are mostly explained by the initial difference in the frequency of residuals. 5To check this, we recalculated the relative frequencies and the LL values after removing residuals and interjections (another CMC-specific part of speech), obtaining the following LLs: adjectives16.74, cjunctions 168.15, numerals 69.54, nouns 0.73, particles -2.49,  pronouns -62.36, prepositions 8.97, adverbs 37.16, verbs -11.92, abbreviations 62.57,  punctuation 69.32.While many of the differences remain significant, most values become smaller, indicating that no linguistic factors beyond those already mentioned are at play.

Table 3 .
The overall percentage of tokens that were transformed is quite close in the two languages: 9.34% (8360) in Croatian and 8.57% (7910) in Serbian.
However, after the transformations due to diacritic omissions are discarded, we are left with 6.87% (6156) transformed tokens in Croatian and 3.81%(3511)transformed tokens in Serbian, which shows that diacritics are omitted more often in Serbian, while Croatian has a greater tendency towards non-standard forms beyond diacritic omission.The frequencies of transformed tokens by PoS shown in Table3are limited to those tokens that have undergone transformations other than diacritic omissions.As above, the log likelihood statistic is reported alongside the frequencies.The highest percentage of transformed tokens is found among interjections (mostly due to vowel or syllable repetitions, as in Hahahahaha), abbreviations (mostly due to omissions of the final punctuation, as in god instead of god.for godina -year), and particles.The most frequently transformed particles with the corresponding absolute frequencies in Croatian and Serbian are jel (shortened from je li -is it, 82 vs. 73), nebi (shortened from ne bi(h) -would not, 16 vs.7), dal (shortened form da li -would it, 12 vs.4), nek (neka -let it, is mostly due to the non-standard ko often being used in Croatian instead of the standard tko -who (also in compunds such as ne(t)ko -somebody), and šta being used instead of što (what), where in Serbian ko and šta are the standard forms.The only two parts of speech that undergo significantly more transformations in Serbian are abbreviations and residuals, the latter possibly due to Croatian containing more URLs, hashtags and @ name mentions, which were not normalised.Among the open part-of-speech classes most transformations happen among verbs (in particular the auxiliary/copula biti -be; see Table5in section 4.2) and adverbs, once again much more frequently in Croatian than in Serbian, as evidenced by very high LL values; one possible reason is the frequent shortening of infinitives in Croatian (e.g.gledat for gledati -watch), which is highly atypical for Serbian.Nouns come next, with a similar percentage of transformed forms in the two languages.Adjectives are placed last and are only slightly more frequently transformed in Croatian than in Serbian, with the difference not reaching significance.this issue through Levenshtein transformations, we focus on a lemma-based analysis.
Comparison of transformation distributions in Croatian and Serbian, with and without (-d) diacritic omission.We next analyse the most frequent specific transformations by language.In Table8we show the top 10 transformations per Levenshtein transformation type, separately for Croatian and Serbian.The 10 most frequent transformations by language and type.As expected, the most frequent deletions in both languages are those of vowels, but with some exceptions as well.In Croatian the most frequent cases are deletions of i (as in al for ali -but, and il for ili -or), the dot (either within punctuation ..., or in abbreviations, as in npr for npr.-e.g.), the space (due to the merging of words such as jel for je li -is it, or nezz for ne znam -I don't know), a (in shortenings such as ko for kao -like and nek for neka -let it), j (due to the use of the ikavian yat reflex, as in di for gdje -where, or uvik for uvijek -always), and e (in shortenings such as bu for bude -will be, or ajd for hajde -come on).In Serbian, the most frequent deletions are those of e (in shortenings like aj for ajde -come on, or jbg for jebi ga -fuck), a (in shortened forms such as ko for kao -like, or reko for rekao -said), i (in jel for je li -is it, al for ali -but, or msm for mislim -I think), the space (in merged words like jel for je li -is it, or ustvari for u stvari -actually), and o (in shortenings like jbt for jebote -fuck, fb for facebook and bi for bismo -we would).This analysis indicates that in Croatian deletions are more frequent on high frequency words, while Serbian shows a tendency towards shortening frequently co-occurring terms or phrases.Insertions in both languages are mostly due to interjections, and some lexical words, containing repeated syllables (e.g.hahahahaha), or repeated vowels (as in vodiiiiiiii -leads).As for replacements, while in Serbian they mostly cover the omission of diacritics and the marking of vowel omissions with an apostrophe (as in je l' for je li -is it, or ost'o for ostao -he stayed, a phenomenon virtually non-existent in Croatian), in Croatian there are three additional frequent cases: e-i (due to the use of the ikavian yat reflex, as in vitar for vjetar -wind), o-a (in the substandard pronoun variant šta (što -what), and the southern dialectal endings of present participles like pogodia (pogodio -he hit) and falija (falio -lacked)), and m-n (transformation of the standard ending m in the southern dialect, as in san (sam -I am) or van (vam -to you),