Charly Moerth, Stephan ProcházkaVienna 2017
5.1 Morphosyntactic annotations
6.1 Translating lemmas and multi-word units
7.10 Invariable nouns and adjectives
10.1 One entry or two entries?
10.5 Constructions vs. sample sentences
12 Sources and responsibilities
The examples in the following guidelines are taken from dictionaries that are being produced as part of the VICAV programme. These are A Machine-readable Dictionary of Egyptian Arabic, A machine-readable dictionary of Rabat Arabic, A Digital Dictionary of Tunis Arabic, A digital dictionary of Damascus Arabic and A machine-readable dictionary of Modern Standard Arabic.
The VICAV dictionaries are encoded according to the Guidelines of the Text Encoding Initiative (P5). They are conceptualised as a specific type of text and are therefore encoded with text elements. Each dictionary starts with a teiHeader element which contains the metadata of the dictionary.
The lexicographic data are placed in typed div elements. Thus, our TEI dictionaries basically look like this:
<TEIversion="5.0"><teiHeader>
...
</teiHeader><text><body><divtype="entries"><entry>...</entry><entry>...</entry><entry>...</entry>
...
...
...
</div></body></text></TEI>
The body of the VICAV dictionaries can not only contain simple entries but also examples which are encoded in cit/quote constructs. The rationale behind keeping example sentences outside the entries is to be able to reuse them in different parts of the dictionary (See below: Examples and Creating examples).
<body><divtype="entries"><entry>...</entry><entry>...</entry><entry>...</entry>
...
...
...
</div><divtype="examples"><cittype="example">...</cit><cittype="example">...</cit><cittype="example">...</cit>
...
...
...
</div></body>
Character encoding is based on Unicode (UTF-8).
There are three types of entries: lemmas and multi-word units (MWUs) and examples.
Lemmas and MWUs basically have the same structure.
<entryxml:id="kitaab_001"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">kitāb</orth></form>
...
</entry>
<entryxml:id="fi_0"><formtype="multiWordUnit"><orthxml:lang="ar-arz-x-cairo-vicavTrans">fi baḥr</orth></form>
...
</entry>
As can be seen in the examples above, entries are assigned a unique identifier. In the VICAV dictionaries, these are made up of characters that are restricted to letters in the ASCII range. Usually, the IDs are created by pressing Ctrl + I. If VLE (the editor tool) creates an ID already existing in the database, the entry can not be saved. In such cases, the user has to modify the ID manually by e.g. increasing the number at the end of the ID.
Examples are encoded making use of a cit/quote construction.
<citxml:id="yibqa_ustaz_001"type="example"><quotexml:lang="ar-arz-x-cairo-vicavTrans">ḥayibʔa ustāz in šāʔ allāh.</quote>
...
</cit>
Ideally, examples should consist in complete sentences. Examples should be concise, but can also contain several sentences. If dialogical models are involved, sentences are to be separated by a dash.
...
<quotexml:lang="ar-arz-x-cairo-vicavTrans">tislam idēki. - ʔaḷḷāh yisallimak.</quote>
...
Proverbs are a subtype of example.
<citxml:id="il_cagala_min_ish_shitaan_001"type="example"subtype="proverb"><quotexml:lang="ar-arz-x-cairo-vicavTrans">il-ʕagala min iš-šiṭān.</quote><cittype="translation"xml:lang="en"><quote>Haste makes waste.</quote></cit></cit>
There are two types of form elements: lemmas and wordforms. Nominals are usually furnished with plural forms, verbs with the third person singular present tense.
<entryxml:id="balad_0"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">balad</orth></form>
...
<formtype="inflected"ana="#n_pl"><orthxml:lang="ar-arz-x-cairo-vicavTrans">bilād</orth></form>
...
</entry>
The different morphological forms are encoded through labels in the ana “analytic” attribute. Examples of frequent labels are:
Value | Meaning |
#adj_f | feminine form of an adjective |
#adj_pl | plural of an adjective |
#n_constructState | construct state of a noun |
#n_pl | plural of a noun |
#n_unit | nomen unitatis |
#v_pres_sg_p3 | 3rd person singular present tense |
#v_vn | verbal noun |
An example with a verb:
...
<formtype="inflected"ana="#v_pres_sg_p3"><orthxml:lang="ar-apc-x-damascus-vicavTrans">yǝnšor</orth></form>
...
The status constructus of a noun can be registered like this:
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">mara</orth></form><gramGrp><gramtype="pos">noun</gram><gramtype="root"xml:lang="ar-arz-x-cairo-vicavTrans">mrʔ</gram></gramGrp><formtype="inflected"ana="#n_constructState"><orthxml:lang="ar-arz-x-cairo-vicavTrans">mirāt</orth></form> ...
Only headwords may have variants. These are encoded as typed forms nested in the top-level form of the entry.
...
<formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">tžawwaz</orth><formtype="variant"><orthxml:lang="ar-apc-x-damascus-vicavTrans">dzawwaž</orth></form></form>
...
This is the only position where the form element may have the type="variant" attribute. All other variants are simply listed but not classified. In the following example two competing morphological forms are listed.
... <formtype="inflected"ana="#v_pres_sg_p3"><orthxml:lang="ar-apc-x-damascus-vicavTrans">yǝṣal</orth></form><formtype="inflected"ana="#v_pres_sg_p3"><orthxml:lang="ar-apc-x-damascus-vicavTrans">yūṣal</orth></form> ...
The next example shows alternative plural forms.
... <formtype="inflected"ana="#n_pl"><orthxml:lang="ar-apc-x-damascus-vicavTrans">kǝnaz</orth></form><formtype="inflected"ana="#n_pl"><orthxml:lang="ar-apc-x-damascus-vicavTrans">kanzāt</orth></form> ...
Variants can be assigned usage labels indicating e.g. a particular register. The more frequent variant should precede less frequent ones.
Translations of lemmas and MWUs are given in sense elements.
<entryxml:id="bard_0"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">bard</orth></form>
...
<sense><cittype="translation"xml:lang="en"><quote>coldness</quote></cit><cittype="translation"xml:lang="de"><quote>Kälte</quote></cit></sense>
...
</entry>
Semantically unrelated homophones or items with clearly differing semantics have to be documented with several sense elements. In the following example, the Egyptian lemma balad is represented with two senses.
... <sense><cittype="translation"xml:lang="en"><quote>country</quote></cit><cittype="translation"xml:lang="en"><quote>land</quote></cit> ... </sense><sense><cittype="translation"xml:lang="en"><quote>city</quote></cit><cittype="translation"xml:lang="en"><quote>town</quote></cit> ... </sense> ...
Another example is rās:
... <sense><cittype="translation"xml:lang="de"><quote>Kopf</quote></cit></sense><sense><cittype="translation"xml:lang="de"><quote>Anfang</quote></cit></sense> ...
Ambiguous translations are often made explicit by additional information narrowing down the semantic scope of the particular item.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vcvTrans">kibīr</orth></form><sense><cittype="translation"xml:lang="en"><quote>big</quote></cit></sense><sense><cittype="translation"xml:lang="en"><quote>old <segtype="hint">of persons</seg></quote></cit></sense> ...
Translation equivalents of examples are indicated in cit/quote constructions.
<citxml:id="tislam_ideeki__001"type="example"><quotexml:lang="ar-arz-x-cairo-vicavTrans">tislam idēki! - ʔaḷḷāh yisallimak.</quote><cittype="translation"xml:lang="en"><quote>Thank you! - Not at all.</quote></cit></cit>
Sometimes it is necessary to add literal translations. This is handled in analogous manner.
<citxml:id="tislam_ideeki__001"type="example"><quotexml:lang="ar-arz-x-cairo-vicavTrans">tislam idēki! - ʔaḷḷāh yisallimak.</quote><cittype="translation"xml:lang="en"><quote>Thank you! - Not at all.</quote></cit><cittype="literalTranslation"xml:lang="en"><quote>May your hands be healthy! - May God keep you healthy.</quote></cit></cit>
When the translation of a term is not very common or easily understandable in the target language, it is common practise to explain the item instead of or in addition to the translation. Explanations can be understood as same language ‛translations’. In TEI, the def “definition” element is used to encode this part of a dictionary entry.
...
<sense><defxml:lang="en">a sweet dessert made of semolina, butter, sugar and rosewater</def><defxml:lang="de">Süßigkeit aus Gries, Butter, Zucker und Rosenwasser</def><cittype="translation"xml:lang="en"><quote>Basbusa</quote></cit></sense>
...
Very often lexical items are particular to the culture of the source language and do not have adequate equivalents in a target language. In such cases, it is important not to enter definitions or explanations in the cit/quote element. Wherever possible, we have tried to furnish translations (very often transliterations) even though they might not be very common in the target language. Explanations have to go into the def element.
The above example shows such a case. In principle, def can be used to encode any information related to ‛meaning’ that does not qualify as a translation in the narrower sense. In the following example the def element simply explains what the place name stands for.
...
<sense><defxml:lang="en">the southernmost of Egypt’s western oases</def><cittype="translation"xml:lang="en"><quote>Kharga</quote></cit>
...
</sense>
...
How to write/transliterate place names and person names is an age-old problem. When several graphematic variants exist, we attempt to choose the most common one.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">šubra</orth></form> ... <sense><defxml:lang="en">a residential area in Cairo</def><cittype="translation"xml:lang="en"><quote>Shubra</quote></cit><cittype="translation"xml:lang="de"><quote>Schubra</quote></cit></sense> ...
Synonyms are encoded as pointers to other entries in the dictionary. They have always to be encoded inside sense elements.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">ʔaṣl-an</orth></form> ... <sense><xrtype="syn"><refxml:lang="ar-arz-x-cairo-vicavTrans">fi l-ʔaṣl</ref></xr><cittype="translation"xml:lang="en"><quote>initially</quote></cit><cittype="translation"xml:lang="en"><quote>originally</quote></cit></sense> ...
The gramGrp element can accomodate a wide range of grammatical information such as word class (=pos: part-of-speech), the consonantal root and/or the verb class.
<entryxml:id="badal_001"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">badal</orth></form><gramGrp><gramtype="pos">verb</gram><gramtype="root">bdl</gram><gramtype="derivedVerbClass">III</gram></gramGrp>
...
</entry>
The gramGrp element can appear in two places: when the information refers to the lemma it is put after the form[@type=lemma] element. In many cases, the grammatical information only refers to particular senses. It is then placed inside the sense element as the first item. For the second case have a look at chapters Arguments and Constructions.
The most common POS labels are listed in the following table. Most of them are self-explanatory.
Label | explanation |
adjective | Adjective |
noun | Noun |
ordinal | Ordinal number |
particle | verb |
pluralNoun | A plural noun that has an entry of its own. This does not necessarily mean that the singular does not exist, but that the plural displays semantic particularities. |
verb | verb |
The labels ideally correspond to ISOcat concepts. However, there are exceptions:
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">ʕarabi</orth></form><gramGrp><gramtype="pos">glottonym</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">ʕrb</gram></gramGrp> ...
Roots are indicated in accordance with etymology. The root of the Arabic equivalent of ‘to yawn’ is encoded like this:
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">ʔittāwib</orth></form><gramGrp><gramtype="pos">verb</gram><gramtype="derivedVerbClass">I-t</gram><gramtype="root"xml:lang="ar-arz-x-cairo-vicavTrans">ṯʔb</gram></gramGrp> ...
Loans of the structure CāC(a) are invariably assigned CʔC.
Word | Root |
bāṣ | bʔṣ |
ḍāma | ḍʔm |
kār | kʔr |
Other loans are reduced to their consonantal skeleton.
In multi-word units the single items are separated by blanks.
... <formtype="multiWordUnit"><orthxml:lang="ar-apc-x-damascus-vicavTrans">rās ǝž-žabal</orth></form><gramGrp><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">rʔs ǧbl</gram></gramGrp> ...
The following table contains a list of special cases.
Word | Root |
mayy, mayye, māyya, mā ... | mwh |
sana ‘year’ | sn |
istanna | ʔny |
kam | km |
qaddēš, ʔaddēš, ... | qdr |
ʔayn, wēn, fēn, ... | ʔyn |
ʔēmta, ʔimta, ... | mty |
šī, šuwayy ... | šyʔ |
walla ‘or’ | w ʔly |
ʔillā | ʔly |
Prepositions are dealt with in the following manner:
Word | Root |
bi | b |
li | l |
ʔilā | ʔly |
ʕalā | ʕly |
maʕa | mʕ |
fī | fy |
Gender is only indicated with morphologically unmarked feminine common and proper nouns.
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">šamᵊs</orth></form><gramGrp><gramtype="pos">noun</gram><gramtype="gender">feminine</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">šms</gram></gramGrp> ...
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">zēnab</orth></form><gramGrp><gramtype="pos">properNoun</gram><gramtype="gender">female</gram><gramtype="root"xml:lang="ar-arz-x-cairo-vicavTrans">zynb</gram></gramGrp> ...
When verbs display special developments in the passive voice they can be treated as lemmata in their own right.
... <formtype="lemma"><orthxml:lang="ar-x-DMG">šufiya</orth><orthxml:lang="ar">شفي</orth></form><gramGrp><gramtype="pos">verb</gram><gramtype="derivedVerbClass">I</gram><gramtype="voice">passive</gram><gramtype="root"xml:lang="ar">شفي</gram><gramtype="root"xml:lang="ar-x-DMG">šfy</gram></gramGrp> ...
In the VICAV dictionaries, we apply a mixed system of indicators mainly making use of the labels traditionally used in Arabic linguistics. In cases not covered by this system, we use labels analogous to Woidich 2006 (Das Kairenisch-Arabische).
Traditional | Woidich | Example |
I | katab | |
II | darris | |
III | zākir | |
IV | ʔalqa | |
t-I | ʔitkatab | |
V | ʔitdarris | |
VI | ʔitdāra | |
VIw | tlūṣiq | |
VII | ʔinṣaṛaf | |
VIII | ʔintaẓaṛ | |
IX | ʔiḥmaṛṛ | |
X | ista-I ista-II ista-III |
ʔistaxdim ʔistirayyaḥ ʔistabārik |
Quadriliteral verbs are assigned the values Iq (=CaCCaC) and IIq (=taCaCCaC).
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">tbahdal</orth></form><gramGrp><gramtype="pos">verb</gram><gramtype="derivedVerbClass">IIq</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">bhdl</gram></gramGrp> ...
Nouns which are only used in the plural form are encoded as pluralNouns.
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">ḥǝlwīyāt</orth></form><gramGrp><gramtype="pos">pluralNoun</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">ḥlw</gram></gramGrp> ...
Collective nouns are usually registered with their respective singulative and the plural forms.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">baṣal</orth></form><gramGrp><gramtype="pos">collectiveNoun</gram><gramtype="root"xml:lang="ar-arz-x-cairo-vicavTrans">bṣl</gram></gramGrp><formtype="inflected"ana="#n_unit"><orthxml:lang="ar-arz-x-cairo-vicavTrans">baṣala</orth></form><formtype="inflected"ana="#n_pl"><orthxml:lang="ar-arz-x-cairo-vicavTrans">baṣalāt</orth></form> ...
If a collective noun has no singulative this is recorded in the following manner:
... <formtype="lemma"><orthxml:lang="ar-aeb-x-tunis-vicav">sfinnārya</orth></form><gramGrp><gramtype="pos">collectiveNoun</gram><gramtype="usg">has no unit noun</gram><colloctype="countNoun"lang="ar-aeb-x-tunis-vicav">kaʕba</colloc><gramtype="root"xml:lang="ar-aeb-x-tunis-vicav">sfnry</gram></gramGrp><sense><cittype="translation"xml:lang="en"><quote>carrots</quote></cit> ... </sense> ...
Elatives should be registered under the respective positive forms.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">kuwayyis</orth></form><formtype="inflected"ana="#adj_sg_f"><orthxml:lang="ar-arz-x-cairo-vicavTrans">kuwayyisa</orth></form><formtype="inflected"ana="#adj_pl"><orthxml:lang="ar-arz-x-cairo-vicavTrans">kuwayyisīn</orth></form><formtype="inflected"ana="#adj_elative"><orthxml:lang="ar-arz-x-cairo-vicavTrans">ʔaḥsan</orth></form> ...
In some cases it may make sense to treat a particular elative as a lexeme in its own right.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">ʔaḥsan</orth></form><gramGrp><gramtype="pos">elative</gram><gramtype="root"xml:lang="ar-arz-x-cairo-vicavTrans">ḥsn</gram></gramGrp><sense><cittype="translation"xml:lang="en"><quote>better</quote></cit><ptrtype="example"target="izzayyi_abuuk__001"/><ptrtype="example"target="ahsan_min_balaash_001"/><ptrtype="example"target="bukra_tib_a_ahsan_001"/><ptrtype="example"target="ahsan_haaga_taaxud_taaks_001"/></sense> ...
Many languages have special words that are placed between a numeral and a counted noun. This word class, which is also referred to as classifier, also exists in spoken Arabic varieties, albeit it is not as pervasive as in other languages such as many eastern Indo-European languages or the languages of East Asia (Chinese, Korean, Japanese).
If a noun is used with a count noun before numerals this is indicated in a colloc element.
... <formtype="lemma"><orthxml:lang="ar-aeb-x-tunis-vicav">brīk</orth><bibl>Ritt-Benmimoun 2014</bibl><bibl>Singer 1984, p.67</bibl></form><gramGrp><gramtype="pos">collectiveNoun</gram><gramtype="root"xml:lang="ar-aeb-x-tunis-vicav">brk</gram><colloctype="countNoun"xml:lang="ar-aeb-x-tunis-vicav">kaʕba</colloc></gramGrp> ...
An analogous example is this:
... <formtype="lemma"><orthxml:lang="ar-aeb-x-tunis-vicav">bṣal</orth><bibl>Ritt-Benmimoun 2014</bibl></form><gramGrp><gramtype="pos">collectiveNoun</gram><gramtype="root"xml:lang="ar-aeb-x-tunis-vicav">bṣl</gram><colloctype="countNoun"xml:lang="ar-aeb-x-tunis-vicav">ṛāṣ</colloc></gramGrp> ...
Many Arabic dialects have nominals which do not display feminine or plural forms. These are identified with a gram element and a morphType attribute.
...
<gramGrp><gramtype="pos">adjective</gram><gramtype="morphType">invariable</gram></gramGrp>
...
Adjectives or nouns which are participles derived from verbal forms are furnished with an additional type="morph" attribute.
<entryxml:id="mashhur_001"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">mašhūr</orth></form><gramGrp><gramtype="pos">adjective</gram><gramtype="morph">passiveParticiple</gram><gramtype="root"xml:lang="ar-arz-x-cairo-vicavTrans">šhr</gram></gramGrp>
...
</entry>
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">mǝtʕāwen</orth></form><gramGrp><gramtype="pos">noun</gram><gramtype="morph">activeParticiple</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">ʕwn</gram></gramGrp> ...
In many cases it is necessary to furnish information about contexts, situations in which lexical items are being used. Such information is typically to be found in sense elements.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">ʔafandim</orth></form><sense><usgtype="prag"xml:lang="en">respectful form of address to a man or a woman</usg><usgtype="prag"xml:lang="de">höfliche Anrede an einen Mann oder eine Frau</usg> ... </sense> ...
When not sure whether to apply a usg or def element, ask yourself if it possible to formulate the information saying used as or used in. In the afandim example above one might consider reformulating the usage label as used as a respectful form of address ....
Another example:
...
<sense><usgtype="prag"xml:lang="en">to attract the evil eye by talking about something</usg><cittype="translation"xml:lang="en"><quote>to jinx</quote></cit></sense>
...
The following code snipped furnishes a good example of a quite complex situation. The noun has two morphological forms which are mutually exclusive. The grammatical information in both senses is corroborated by examples.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">maṛa</orth></form> ... <sense><usgtype="prag"xml:lang="en">often derogatory</usg><usgxml:lang="en">absolute state only</usg><cittype="translation"xml:lang="en"><quote>woman</quote></cit><ptrtype="example"target="inti_ya_mara_tacaali_001"/></sense><sense><usgxml:lang="en">construct state only</usg><cittype="translation"xml:lang="en"><quote>wife</quote></cit><ptrtype="example"target="miraatu_ga_lha_walad_001"/></sense> ...
Mind that the usg element can also be applied in sample sentences.
...
<citxml:id="sharraftuuna_allaah_yisharraf_mi_daarak_001"type="example"><quotexml:lang="ar-arz-x-cairo-vicavTrans">šaṛṛaftūna. - ʔaḷḷāh yišaṛṛaf miʔdāṛak.</quote><usgtype="prag"xml:lang="en">This phrase is used upon leaving.</usg><cittype="literalTranslation"xml:lang="en"><quote>You have honoured us (i.e. visiting us). - God may honour your worth.</quote></cit></cit>
...
Another option are usage labels which can be used to indicate functional constraints.
... <form><orthxml:lang="ar-apc-x-damascus-vicavTrans">la-ḥāl-</orth></form><gramGrp><gramtype="usg">only with pronominal suffixes</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">l ḥwl</gram></gramGrp> ...
The values in the xml:lang attributes have been designed in compliance with Best Current Practice 47 (BCP 47) which in turn refers to and aggregates a number of ISO standards (639-1, 639-2, ISO 15924, ISO 3166). The labels used as values in the xml:lang attributes reflect a hybrid system that indicates both linguistic variety and writing system.
Value | Explanation | Used in dictionary |
de | German | |
en | English | |
ar-aeb-x-tunis-vicav | Tunis Arabic, VICAV transcription | aeb_eng_001__v001 |
ar-arz-x-cairo-vicavTrans | Cairo Arabic, VICAV transcription | arz_eng_006 |
ar-arz-x-cairo-arabic | Cairo Arabic, Arabic script | arz_eng_006 |
ar-apc-x-damascus-vicavTrans | Damascus Arabic, VICAV transcription | apc_eng_002 |
ar-ary-x-sale-vicavTrans | Sale Arabic, VICAV transcription | ary_s_rabat_eng_002 |
ar-ary-x-vicavTrans | Moroccan Arabic, VICAV transcription | ary_eng_001 |
Etymologies are encoded by means of the etym element. According to our schema, these also have to be top-level elements which are placed after the gramGrp element.
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">ʔaṣanṣēr</orth></form><gramGrp>...</gramGrp><etym>loanword<lang>French</lang><mentioned>ascenseur</mentioned></etym> ...
The applied system is supposed to retain flat hierarchies.
Feminine forms of substantives are encoded as separate entries.
<entryxml:id="gaar_001"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">gāṛ</orth></form> ... <sense><cittype="translation"xml:lang="en"><quote>neighbour</quote></cit></sense></entry><entryxml:id="gaara_001"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">gāṛa</orth></form> ... <sense><cittype="translation"xml:lang="en"><quote>neighbour (female)</quote></cit></sense></entry>
Homonyms with diverging morphological forms are also encoded as separate entries.
<entryxml:id="beet_000"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">bēt</orth></form><gramGrp><gramtype="pos">noun</gram><gramtype="root"xml:lang="ar-arz-x-cairo-vicavTrans">byt</gram></gramGrp><formtype="inflected"ana="#n_pl"><orthxml:lang="ar-arz-x-cairo-vicavTrans">biyūt</orth></form><sense><cittype="translation"xml:lang="en"><quote>house, home</quote></cit></sense></entry><entryxml:id="beet_001"><formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">bēt</orth></form> ... <formtype="inflected"ana="#n_pl"><orthxml:lang="ar-arz-x-cairo-vicavTrans">ʔabyāt</orth></form><sense><cittype="translation"xml:lang="en"><quote>verse (in poem)</quote></cit></sense></entry>
In Arabic, there is often no clear delineation between nouns and adjectives. The issue is particularly tricky with nominals derived by means of the Nisba suffix. In the VICAV dictionaries, these are treated in four categories: adjectives, masculine nouns, feminine nouns and the special case of glottonyms. By adjectives, we understand nominals that are mostly used attributively and usually display masculine and feminine forms:
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">sūri</orth></form><gramGrp><gramtype="pos">adjective</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">swr</gram></gramGrp><formtype="inflected"ana="#n_f"><orthxml:lang="ar-apc-x-damascus-vicavTrans">sūriyye</orth></form><formtype="inflected"ana="#n_pl"><orthxml:lang="ar-apc-x-damascus-vicavTrans">sūriyyīn</orth></form><sense><cittype="translation"xml:lang="en"><quote>Syrian</quote></cit> ... </sense> ...
Cases such as the following are categorised as nouns:
<formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">šāmi</orth></form><gramGrp><gramtype="pos">noun</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">šʔm</gram></gramGrp><formtype="inflected"ana="#n_pl"><orthxml:lang="ar-apc-x-damascus-vicavTrans">šwām</orth></form><sense><cittype="translation"xml:lang="en"><quote>Damascene</quote></cit><cittype="translation"xml:lang="de"><quote>Damaszener</quote></cit><cittype="translation"xml:lang="es"><quote>damasceno, de Damasco</quote></cit></sense> ...
The respective feminine form looks like this:
<formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">šāmiyye</orth></form><gramGrp><gramtype="pos">noun</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">šʔm</gram></gramGrp><formtype="inflected"ana="#n_pl"><orthxml:lang="ar-apc-x-damascus-vicavTrans">šāmiyyāt</orth></form><sense><cittype="translation"xml:lang="en"><quote>Damascene woman</quote></cit><cittype="translation"xml:lang="de"><quote>Damaszenerin</quote></cit><cittype="translation"xml:lang="es"><quote>damascena, de Damasco</quote></cit></sense> ...
Ethnonyms, demonyms and similar proper nouns are also placed in separate entries:
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">ʕarabi</orth></form><gramGrp><gramtype="pos">glottonym</gram><gramtype="root"xml:lang="ar-apc-x-damascus-vicavTrans">ʕrb</gram></gramGrp><sense><usgtype="dom">linguistics</usg><cittype="translation"xml:lang="en"><quote>Arabic</quote></cit></sense> ...
This phenomenon is related to the MSA dictionary only. Diptosy is indicated either on the lemma form or plural forms. In the first case the information is stored in the gramGrp element.
... <formtype="lemma"><orthxml:lang="ar-x-DMG">ˀazraq</orth><orthxml:lang="ar">ازرق</orth></form><gramGrp><gramtype="pos">adjective</gram><gramtype="morphType">diptotic</gram> ... </gramGrp> ...
For plural forms, we make use of a special n_diptPl value.
...
<formtype="inflected"ana="#n_diptPl"><orthxml:lang="ar-x-DMG">rasāˀil</orth><orthxml:lang="ar">رسائل</orth></form>
...
The relation between a lexical item and its dependents is also encoded making use of the gramGrp element. By contrast to the cases dealt with before this gramGrp element is not placed on the top-level of the entry but inside sense elements.
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">dawwaṛ</orth></form><sense><gramGrp><gramtype="arguments"xml:lang="ar-apc-x-damascus-vicavTrans">ʕala</gram></gramGrp><cittype="translation"xml:lang="en"><quote>to look for, to search</quote></cit></sense> ...
Multi-word units on the level of independent dictionary entries have to be distinguished from constructions that correlate with particular senses. These are encoded in form elements. Consider the following example:
...
<sense><formtype="construction"><orthxml:lang="ar-arz-x-cairo-vicavTrans">baʔa + li- +
<segtype="constrPart">pronSuffix | noun</seg> +
<segtype="constrPart">timeExpression</seg></orth></form>
...
</sense>
...
Information regarding the translation of the particular construction is represented as in other senses:
...
<sense><formtype="construction"><orthxml:lang="ar-arz-x-cairo-vicavTrans">baʔa + li- +
<segtype="constrPart">pronSuffix | noun</seg> +
<segtype="constrPart">timeExpression</seg></orth></form><cittype="translation"xml:lang="en"><quote>since, for</quote></cit></sense>
...
One sense can have several instantiations of a construction.
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">ʔalᵊb</orth></form> ... <sense><cittype="translation"xml:lang="en"><quote>heart</quote></cit> ... </sense><sense><formtype="construction"><orthxml:lang="ar-apc-x-damascus-vicavTrans">ʔalb + <segtype="constrPart">pronSuffix</seg> + ṭayyeb </orth></form><formtype="construction"><orthxml:lang="ar-apc-x-damascus-vicavTrans">ʔlūb + <segtype="constrPart">pronSuffix</seg> + ṭayybīn </orth></form><cittype="translation"xml:lang="en"><quote>to be kind-hearted</quote></cit> ... </sense> ...
An anologous case is the following one:
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-vicavTrans">ʕand</orth></form> ... <sense><cittype="translation"xml:lang="en"><quote>with, by, next to</quote></cit></sense><sense><formtype="construction"><orthxml:lang="ar-arz-x-cairo-vicavTrans">ʕand + <segtype="constrPart">pronSuffix</seg></orth></form><cittype="translation"xml:lang="en"><quote>to have</quote></cit></sense> ...
It is not always easy to decide whether to put information into a construction or a sample sentence. By construction, we understand strings of words with variable elements. They can be conceived as patterns with particular slots holding variables. Sample sentences would then be particular instantiations of such a pattern.
<entryxml:id="tbaarek_001"><formtype="lemma"><orthxml:lang="ar-ary-x-sale-vicavTrans">tbāṛek</orth><orthxml:lang="ar-ary-x-sale-vicavTrans">تبارك</orth></form><gramGrp><gramtype="root"xml:lang="ar-ary-x-sale-vicavTrans">brk</gram></gramGrp><sense><formtype="construction"><orthxml:lang="ar-ary-x-sale-vicavTrans">tbāṛek + aḷḷāh + ʕla +
<segtype="constrPart">pronSuffix | noun</seg></orth></form><cittype="translation"xml:lang="en"><quote>how nice is ...! how wonderful ...!</quote></cit><cittype="translation"xml:lang="de"><quote>-</quote></cit></sense></entry>
Information concerning the level of formality is tagged using the usg element with a type="reg" attribute. This may occur in three positions of an entry.
Information qualifies ... | Tag is placed in ... | |
lemma | gramGrp | The information refers to the lexical item as a whole. |
other word forms | form | The information refers to a particular form only. |
sense | sense | The information refers to a particular sense only. |
Some words are usually used in formal occasions.
... <formtype="lemma"><orthxml:lang="ar-arz-x-cairo-arabic">ثقافة</orth></form><gramGrp><gramtype="pos">noun</gram><gramtype="root"xml:lang="ar-arz-x-cairo-vicavTrans">ṯqf</gram><usgtype="reg">formal</usg></gramGrp> ...
Many nouns have several plural forms. This apparent overabundance can often be explained by varying degrees of formality.
... <formtype="inflected"ana="#v_pp_m"><usgtype="reg">informal</usg><orthxml:lang="ar-arz-x-cairo-vicavTrans">mittihim</orth></form><formtype="inflected"ana="#v_pp_m"><usgtype="reg">formal</usg><orthxml:lang="ar-arz-x-cairo-vicavTrans">muttaham</orth></form> ...
Basic semantic classifications are stored in ‘domain’ labels which are placed inside sense elements. They are put right at the beginning of the sense elements. Senses can be assigned multiple such labels.
... <formtype="lemma"><orthxml:lang="ar-apc-x-damascus-vicavTrans">fūl</orth></form> ... <sense><usgtype="dom">food</usg><usgtype="dom">plants</usg><cittype="translation"xml:lang="en"><quote>beans</quote></cit></sense> ...
As we have seen before, examples are separate records. They are not encoded with the lemmas. Examples are always linked to particular senses. They are referenced through ptr “pointer” elements which are put at the end of the respective sense element.
...
<sense><cittype="translation"xml:lang="en"><quote>to be, to become</quote></cit>
...
<ptrtype="example"target="yibqa_ustaz_001"/></sense>
...
The example referrenced in the above ptr element looks like this:
<citxml:id="yibqa_ustaz_001"type="example"><quotexml:lang="ar-arz-x-cairo-vicavTrans">ḥayibʔa ustāz in šāʔ allāh.</quote><cittype="translation"xml:lang="en"><quote>He will become a professor (hopefully).</quote></cit></cit>
To add such a link follow these steps:
1. | Go to the example entry. | The focus has to be in the editor. |
2. | Copy the ID to the clipboard | ... by pushing F11. Make sure that this key is defined in your list of key assignments. |
3. | Go to the entry into which you want to insert the link. | |
4. | Move to the insert position in the appropriate sense element. | ptr should be at the end of a sense element. |
5. | Insert the pointer | ... by pushing Ctrl + V. |
Some of our dictionaries contain bibliographic references concerning the source of particular entry components. This type of information is typically encoded in bibl elements.
... <bibl><author>Ritt-Benmimoun</author><date>2012/2013</date></bibl> ... <bibl><author>Singer</author><date>1958</date><biblScopeunit="page">56</biblScope></bibl> ...
In production stage, abbreviated versions are permissible.
...
<bibl>Singer 1958, p.56</bibl>
...
Ideally, the sources should be resolved in the header.
bibl may be embedded in form elements. A form element can contain several bibl elements.
...
<formtype="lemma"><orthxml:lang="ar-aeb-x-tunis-vicav">markaz</orth><bibl>Ritt-Benmimoun 2014</bibl><bibl>Singer 1958, p.34</bibl></form>
...
Furthermore, the bibl element can be placed inside cit elements which are used to encode usage examples.
<citxml:id="limcit_cineeh_001"type="example"><quotexml:lang="ar-arz-x-cairo-vicavTrans">limʕit ʕinēh.</quote>
...
<bibl>4/82</bibl></cit>
The third option are senses. As bibl can not directly be put in sense, xr has to be wrapped around:
...
<sense><cittype="translation"xml:lang="en"><quote>to call to prayer</quote></cit><cittype="translation"xml:lang="de"><quote>zum Gebet rufen</quote></cit><xr><bibl>8/10</bibl></xr><xr><bibl>19/205</bibl></xr></sense>
...
By convention, all bibl elements are placed at the end of the containing element.
The Guidelines of the Text Encoding Initiative are currently available in eight languages. Go