Distribution of Letters in Oromiffa Text

Published by Editor In Chief on

by
Demessie G Yahii*
(published in the Journal of Oromo Studies nearly 30 years ago!)

Introduction

The distribution of letters in a written language has for long been of interest to linguists and information scientists alike. If the distribution of letters of alphabet is computed for a representative sample of a written language, then some statistical regularity [1] can be observed. This is due to the fact that every language has a unique set of sound (phonetic) attributes that makes it distinct from other languages, and that letters or combinations thereof are merely symbols for representing these basic sound elements [2].

In a typical English text, for instance, the letter E is the most frequent vowel (about 10%) while the letter T is a dominant consonant (about 8%). In general, different languages tend to have different distribution of dominant vowel and consonant letters due to the underlying phonetic and orthographic differences. By determining the distribution of all letters in a written language, it is possible to provide a language profile that can to some extent be used to distinguish one language from another.

The distribution of letters in a language per se also has many practical applications particularIy in information technology systems for manipulating information by way of compression, encryption, transmission and the like. Although most practical applications came about recently with the use of computers, the earliest application goes back in history well before the development of computers. The Morse code, devised over a century ago for transmitting telegraphy, was based on the statistical average of letters of the English alphabet in order to minimise the overall transmission time of a text message. For this to be possible, the most frequent letters (such as E and T) were represented with shorter codes whereas longer codes were used for less frequent letters (such as Q and Z).

This article presents the distribution of letters in Oromiffa (the Oromo language). It is now over two years since the Latin-based Oromiffa alphabet or Qubee has been in use nationwide. Over this period, various Oromiffa publications have appeared, and these have helped the orthography develop and mature over a short period of time. Oromiffa is transcribed almost phonetically, and this has been described by T. Gamta [3]. Here, a brief overview of the basic principles will be discussed so that the reader can easily grasp how the distribution of letters relate to the underlying phonetic transcription rules and understand the results and comments made in later sections.

Overview of Oromiffa Transcription

Oromiffa has 34 basic sounds (phonemes) comprising of 10 vowel phonemes (Table 1) and the 24 consonant phonemes (Table 2) below [4]. The 10 vowel phonemes are actually made up of five basic vowel sounds each with a short and a long phoneme This linguistic property coincidently makes a perfect match with the Latin alphabet [5]—the short vowel phonemes are represented with each of the five vowel letters while the long vowel phonemes are represented by doubling the vowel letters as shown in Table 1.

Single-letter consonant symbols have the usual English sound except for C, Q and X which are used to represent different sounds in Oromiffa. The digraphs CH and SH are also as in English while DH, NY and PH represent different sounds.

Each of the consonant sounds can be weak or strong [6] and, analogous to the short and long vowels, weak consonants are represented by a single symbol while stressed consonant sounds use a double symbol. In the case of digraphs, only the first letter is doubled for stressed consonants, for instance, when the DH sound is stressed it is written as DDH.

ShortLong
aaa
eee
iii
ooo
uuu

Table 1: Oromiffa vowel sounds (Total: 10)

Oromiffa has a considerable amount of glottal stops (see Table 2). An apostrophe, and less commonly a hyphen, is used to represent this sound in writing. Sometimes an H, which represents the closest glottal sound, is also used in place of an apostrophe. For a reason to be apparent later, the apostrophe will be considered as a distinct symbol (say, as the 27th letter of alphabet) in the analysis presented here.

Comments
Sounds are as in English unless stated otherwise.
b
cglottalized palatal (never as s or k)
ch
d
dhglottalized dental
f
galways sounds as in green, never as in general where j is used
h
j
k
l
m
n
nypalatal nasal (as in Spanish "senyor" for "Mr"
phbilabial ejective
qvelar ejective
r
s
sh
t
w
xdental ejective
y
'glottal stop as in "a'a" to mean "no" often written as "uh-uh"

Table 2: Oromiffa consonant sounds (Total:24)

Remarks
jhsounds as "su" in measure
khsounds as "ch" in the German proper name Bach
pexample: papayyaa, poolisii, etc
tssounds as in Amharic tsehay meaning sun
vexample: vayoliinii, vazeliinii, etc
zexmple: zayitii, zeeroo, zaytuunaa, etc

Table 3: Foreign consonant sounds (Total: 6)

Oromiffa does not have sounds represented by the letters P, V, and Z in English. These letters and three additional digraphs (jh, kh, and ts) representing non-Oromiffa sounds (Table 3) provide an almost complete set of symbols, not only for transcribing foreign words, but also facilitate the transliteration of other languages [7].

The above summarises the basic orthographic symbols used for the phonetic transcription of Oromiffa. Non-standard orthographic symbols such as numbers and symbols, like the symbol $, will not be considered in the analysis and their frequency is negligibly small anyway. It is also worth mentioning that Oromiffa makes use of the punctuation signs as in English and again these will not be considered in the analysis.

Data Collection and Results

Oromiffa texts of various articles which appeared in a cross-section of magazines [8] were first scanned onto a computer. Almost all texts contain a small proportion of numerals, abbreviations and acronyms and no regularisation of the orthography was necessary. There were some spelling and orthographic errors in the texts [9] but the effect of these on the overall distribution of letters was found to be almost negligible.

LetterFrequency (%)LetterFrequency (%)
A23.6N6.5
B2.8O5.1
C1.0P0.1
D3.7Q0.8
E6.6R4.0
F2.5S3.2
G2.0T4.8
H *3.4 (4.2)U6.4
I9.1V0.0
J1.1W1.0
K3.3X0.2
L2.4Y1.4
M4.0Z0.0
'0.8

Table 4: Distribution of letters in Oromiffa text

* Note: The distribution of H will increase to 4.2% if it is also used for glottal stops as proposed.

The above table summarises the average percentage of letters in Oromiffa text. The most frequent vowel letter quite predictably is A while the most frequent consonant is N. These contrast, respectively, with E and T in English text. it is interesting to note that the glottal stop represented by an apostrophe is more frequent than the letter X which is at the bottom of the list. As mentioned earlier, Oromiffa does not have sounds represented by P, V and Z. Unlike V and Z, however, since P is used with H to form an Oromiffa digraph sound its count is not zero.

Some Observations

Oromiffa text is dominated by vowel letters due to the fact that both short and long vowels are represented explicitly for phonetic transcription. Table 4 shows that all vowel letters account for about 50% of the text. Although this seems a bit strange and inconvenient, in practice, it is very easy to learn and, most importantly, it has an advantage that outweighs other (non-phonetic) alternatives. The fact that words are written and read in the way they sound would mean that word spellings or word pronunciations need not be memorised. This is the case because phonetic transcription is a rule-based technique for writing and reading unambiguously. One can appreciate this advantage by contrasting the effort required for memorising word spellings and pronunciations in the English language. Phonetic transcription alleviates language learning shifting the focus on word semantics and grammar rules rather than word spellings and pronunciations which can be worked out with simple rules without memorising them [10].

At the outset an attempt was made to determine the distribution of letters for prose text and poem text separately This is because the latter has rhythmic arrangement of syllables and alliterations which tend to use more vowels. As it turned out, the only significant difference was that the letter E tends to increase from 6.6% in prose text, to about 9% in poem text. This can be explained by the fact that many lines of verse end a double E for poems.

Other useful information can also be extracted from the result summary with some caution. For example, the most dominant consonant sound in Oromiffa is /n/. To be certain of this, however, the contribution of n to the /ny/ sound should also be checked, and this indeed is relatively small as determined by the frequency of the letter Y. On the other hand, the frequency of H does not tell us much about the /h/ sound since it is used for digraphs such as CH and DH representing other sounds. With regards to sound distribution, the result can only be used as a rough guide. For more accurate sound distribution, the sound symbols can be easily analysed in the same way by considering the phonemes shown in Tables 1 and 2 rather than letters of the alphabet as reported here.

Another interesting observation is the distribution of letters with regard to their positions on the QWERTY keyboard layout. The first four (or six) most frequent letters accounting for over 40% (or over 50%) are almost evenly distributed between the left-hand side and the right-hand side of the keyboard which is a desirable arrangement for professional typists.

The apostrophe used as a symbol for glottal stops accounts for nearly 1% in Oromiffa text as shown in Table 4. The fact that it is not a letter means that it is treated differently, and this makes it a bit awkward in writing. When a glottal sound is stressed then double symbols must be used if the rule of phonetic transcription has to be followed and the use of apostrophes can be very confusing [11].

The problem of the apostrophe also arises in computer systems as it is usually treated as a special symbol in certain cases The consequence of this is that unlike letters apostrophes cannot be arbitrarily used, for instance in names for users or computer files or program variables [12]. Hence names with apostrophes can be discriminated. It is, therefore, desirable to eliminate the apostrophe, preferably by replacing it with a letter.

A Recommendation for Eliminating the Apostrophe

The problem of glottal stops is not peculiar to Oromiffa. London Cockney, a dialect of English, for example is very rich in glottal stops since most /t/ sounds are pronounced as glottal stops as in “butter”. In an attempt to reflect this phonetic feature, Barltrop & Wolveridge [13] proposed the exclamation sign (!) for glottal stops although this has never been put to practice. The use of (!), however, presents the same problem as the apostrophe as described above and this experience does not lend a solution other than its historical note.

An ideal solution to the problem of glottal stops is to use a letter symbol. Unfortunately, there is no spare letter that can be freely assigned and the use of double letters like CH or DH may not be that attractive either. Instead, an attempt has been made here to look more into a linguistic property of Oromiffa that is suggestive of a letter symbol for a glottal stop.

A close investigation of basic Oromiffa sounds reveal that the sound /h/ occurs only at the beginning of a word [14]. On the other hand, the glottal stop /?/ (represented by an apostrophe in writing) occurs within and not at the beginning of a word. In fact, an apostrophe representing a glottal stop is in most cases surrounded by vowel letters [15]. In this sense, the /h/ and /?/ sounds are in complementary distribution, that is, they occur in different positions within a word. This suggests that the letter H can be used for both sounds—by treating it as /h/ sound at the beginning of a word and as a glottal stop /?/ otherwise. This is the only rule that needs to be observed and has no pedagogic or other problems [16]. If this recommendation is put to practice, the distribution of the letter H in the above result summary (Table 4) will increase to 4.2%.

Conclusion

The distribution of letters in Oromiffa text was analysed and summary results presented. This provides more insights into written Oromiffa, particularly considering its recent alphabetisation. Such investigations can provide the basis for many interesting applications in linguistics and information processing systems.

The use of letter H has been proposed for glottal stops in addition to the /h/ sound. This is possible because in Oromiffa the /h/ sound and the glottal stop are in complimentary distribution within a word. This eliminates the problems associated with the use of the apostrophe symbol. It is hoped that writers and educators alike adopt this recommendation immediately.

End Notes

  1. This is confined to languages based on the Latin alphabet although it may also be extended to languages not using the Latin alphabet.
  2. Note however that the distribution of letters does not necessarily correspond to the distribution of sounds; For instance, the distribution of the letter D in Oromiffa is contributed by sounds /d/ and /dh/, and it does not on its own tell us how these sounds are distributed. The statistical regularity of D in a text simply tells us the statistical regularity of all the sounds it represents, in this case, /d/ and /dh/. The distribution of individual sound elements or phonemes can be analysed in the same way but this time the digraphs such as CH and DH are distinct from C and D. A further comment may be necessary for cases where letters or combinations thereof are silent, for example, as in English, but we restrict ourselves to phonetic transcription as in Oromiffa.
  3. Gamta, Tilahun: “Qubee Afaan Oromoo: Reasons for choosing the Latin Script for developing an Oromo alphabet,” Oromo Commentary, Vol Ill, No 1, 1993.
  4. This figure does not account for allophones (variants of basic sounds) such as the variation of consonant sounds of /n/ as in “nama” (man) versus “sangaa” (ox), or different vowel sounds of the first and last /a/ in “nama”. See also T Gamta, Oromo-English Dictionary, 1989.
  5. In this respect the Latin alphabet serves the Oromo language better than the English language. One of the principal problems in transcribing English phonetically is that there are many more vowel sounds than there are vowel letters One widely spoken British accent, known as Received Pronunciation, has twenty vowel sounds and American accents and other British accents have over twenty vowel sounds. See Ladefoged, Peter: A Course in Phonetics, 2nd Ed, Harcourt Brace Jovanovich, Publishers, 1982.
  6. There are a few exceptions: CH and NY are always strong. Hence, it would not be necessary to apply the rule for strong consonant phonemes.
  7. This can be very useful while learning other languages.
  8. Magazines include Madda Walaabuu, Odaa, and Qunnamttii among other sources.
  9. Spelling mistakes such as “wa’ee” for “wayee”; and orthographic variations such as “Finfinnera” for “Finfinnee irra” and “isartatti” for “isa irratti”. These variations are and will be fast evolving into standardised forms as has been observed over the last two years.
  10. It is understandable that even with phonetic transcription memorisation it still there instinctively with repeated usage, but this is an uncompelled and voluntarily process and not forced upon the learner.
  11. This can arise in words such as Oo??a for hot. The use of the apostrophe will result in Oo’a, which does not indicate the stress of the consonant or 00′ ‘a which is confusing due to the fact that the double apostrophe resembles a quotation mark.
  12. Such problems can arise in programming languages where object names or labels may not use the apostrophe symbol. At the user level, the Unix and to a lesser degree the DOS operating systems also have some restrictions in this respect.
  13. Barltrop & Wolveridge: The Muvver Tongue.
  14. Extensive analysis was carried out to disprove this claim. The /h/ sounds not at the beginning of a word are those which should have been glottal stops or /y/ sound as in “tahe” for “ta’e” and “dhiha” for “dhiya”.
  15. In a few cases the glottal stop can occur after a consonant as in “har’a” for “today” or “many’ee” for “joint”.
  16. In common with other letters such as C or X, H will of course retain its /h/ sound in non Oromiffa words. In fact, since /h/ and /?/ are both glottal sounds, no major sound variation would be noticed even if this rule is overlooked for non-Oromiffa words. In English, the words that can create this problem (words with h’s surrounded with vowels) are very rare.

* Demessie G Yahii, PhD, is a consultant in Computer and Communication Systems and has published several articles in the area. His other areas of interest include the application of computers to linguistics and the study of Oromiffa for scientific and technical fields. Dr Yahii resides in London.

Facebook Comments

Categories: ArticleTechnology