Automatic Completion Of Computational Linguistic Resources

Published Date: 02 Nov 2017

Author11, Author22, Author33

(1) INSTITUTE_1, address 1

(2) INSTITUTE_2, address 2

(3) INSTITUTE_3, address 3

Author1@mail1, Author2@mail2, Author3@mail3

Abstract

This article represents a description and a realization of a new methodology of studying the issues in computational derivational morphology, related to the algorithmization of certain linguistic mechanisms, e. g., affix substitution, derivatives projection, derivational constraints and formal derivational rules. The established mechanisms permitted the elaboration of algorithms and corresponding programs. All this led to generation of a significant number of derivatives with different affixes.

Keywords: computational linguistic resources, derivational algorithm, affix, automatic derivative generation, generative derivational mechanisms.

Introduction

The linguistic resources represent the main support for the development of the automatic tools in the linguistic information processing. One of the most important aspects is the studying of the problems referring to automatisation of the linguistic resources creation. The need of the lexical resources enrichment is satisfied not only by borrowings of words from other languages, but also by the use of some exclusively internal processes. The most important ways of words forming are: inflection, derivation and compounding. In this article we are studying derivation.

The particularities of the derivational morphology mechanisms help in lexical resources extension without any semantic information. The approaches and mechanisms presented in the paper have been studied on the examples from Romanian language, which affixes were inherited from different origins, namely: Latin, Slavic, Greek, and other. Thatâ€™s way the majority of cases can be applied to different languages. Moreover, there are processing mechanisms similar for different languages spoken in Europe, namely English, French, Spanish, Russian, Romanian. The obtained results of the investigation can be used in different domains of natural language processing, for example: stemming, language detection, machine translation, information retrieval.

The purpose of the research presented in this article is to study the mechanisms and elaborate algorithms for automatic generation of the derivate words for resources completion.

In this context the paper is structured in the following way. First we will present the stages in procedural completion of the computational linguistic resources. Then derivational particularities of Romanian language are described, mentioning the information about the collection of Romanian affixes, the relation between the affixes and the parts of speech of the derivatives, the conclusions referring to the process of establishing the correspondence "inflection group â€“ derivation group", the automatic derivative recognition algorithm, linguistic database extension for derivational process studying and the consonantal and vocalic alternations features in the process of derivation. A special compartment is dedicated to multilingual approaches of derivatives generation that consist of a method for establishing candidate words for derivation by affixes substitution, .formal models, derivativesâ€™ projection and derivational constraints. As the process of derivation is an overgenerating mechanism, a method of validating of the generated derivatives is presented. At the end we presented an algorithm of automatic lexical derivation with the results of the computational linguistic resources completion.

Procedural completion of the computational linguistic resources

This study aims to exploit existing resources in such a way that it is possible to generate lexical derivational families of Romanian language. Comparing the intentions of this work to the Italian model (Carota, 2006) and its derivational morphology, we observed the reversal of priorities. In the case of Italian model the thesaurus is organized so that it is possible to draw derivative families present in the resource.

The subject of the research is the procedural method, for which it is necessary to establish rules so that the derivatives can be obtained in an algorithmic way from root/stem (Boian et al., 1994).

Considering the productive properties of the derivation process, the lexicon completion can be performed using automated means (Boian et al., 2011a). Schematically this process represents a cycle (Figure 1). This cycle can be applied several times starting from the existing lemmas of the lexicon (Cojocaru et al., 2009). After a limited number of cycles it is possible that the cycles can no longer produce new words. Finally we obtain a completely "saturated" lexicon in terms of derivation.

Figure 1 â€“ The scheme of procedural completion of the computational linguistic resources.

Initially the problems of automatic inflection have been studied and solved, because inflection has a more regular behaviour comparing to derivation. Inflection offers the possibility to generate valid words. Derivation permits to create new words which, in general, cannot guarantee its semantic correctness. Nevertheless, derivation has some similar specific features with inflection, the problem of automatic derivation presents a high level of complexity comparing to inflection.

In this article we tried to find some word classes where this formalization can be made. In this case it is necessary to identify these word classes, to establish its characteristics and to formulate derivation rules. So, automatic derivation process requires preliminary experiments, which would allow the deduction of the mechanisms relating to the behaviour of Romanian language affixes. In our case we will work with 3 Romanian computational resources which are the most reliable to our scope: DMLR (Morphological Dictionary of the Romanian language in the electronic version), RRTLN (Reusable Resources of Natural Language Technology) and eDCD (Dictionary of derivative words in electronic version, adapted to the needs of studying mechanisms and elaboration of algorithms for automatic generation of derived words). DMLR is a significant resource for Romanian language and represents a morphological dictionary (Lombard and GÃ¢dei, 1981). This dictionary contains about 30000 words that belong to various parts of speech like: nouns, adjectives and verbs, which are divided into classes depending on the inflection of their training. An example of an entry in the DMLR is:

echilibra V201

where (a) echilibra is the word base that means (to) equilibrate, and V201 denotes inflection class, in this case: the verb group 201 (Cojocaru, 1997).

RRTLN [1] - contains a database of linguistic morphologic information and a set of programs that manage the database (Boian et al., 2005). Thus, the thesaurus contains not just parts of the speech, but also information about the categories and the possible morphological analyses of syntactic functions. RRTLN has about 100000 word lemmas and about 1000000 flexions. It should be mentioned that a word can have several entries for different parts of speech, because of different semantics, e. g., the word "bun" as an adjective means good, as an adverb means approving and as a noun means property.

eDCD - contains only the list of derivatives and constituent morphemes without having information about the part of speech of the derivatives and its morphemes, although the vast majority are nouns, verbs and adjectives. eDCD was obtained after the paper version was scanned, OCR-ized and corrected using the original entries. eDCD allows detection of derivatives morphemes with the appropriate type (preï¬x, root and suffix) (Petic, 2009). For easier processing of the lexicon entries, a regular expression was developed, which represents the following derivative structure:

derivat = (+morpheme)*.morpheme(âˆ’morpheme)*

where +morpheme represents a preï¬x, .morpheme is a stem, and âˆ’morpheme is a suffix. An example of an entry in the lexicon is:

antistatal=+anti.stat-al

reprogramabil=+re.programa-bil

Thus the basis for generating new derivatives is an existing lexicon. The lexicon should contain not only graphical representation of the words, but also their parts of speech, inflection class and constituent morphemes. In the brief description above we see that the information in each of these three computational linguistic resources is different. Therefore, this article will present several studies that will use several resources simultaneously.

Derivational particularities of Romanian language

As it was stated above, the problem of automatic derivatives generation for inflective natural languages presents enormous difficulties. Theyâ€™re caused by the impossibility to formalize completely the semantic aspects of derivational process. Analysing the approaches from the derivational morphology application of other languages, we came to a conclusion that the solution for the problem of automatic derivation for Romanian language needs the following research and elaborations:

The establishing of quantitative and qualitative features of the derivatives;

The processing of several lexicons in order to adapt them to our aim;

The elaboration of the algorithm of automatic recognition of the derivatives.

Collection of Romanian affixes

The word that is formed by adding a prefix or suffix is called derivative (Carstairs-McCarthy, 2010). Any morpheme that is outside the root of the word is called affix. Depending on the position it occupies to the root, the affixes are divided into two categories:

Placed before the root (prefixes);

Attached at the end of the root (suffixes).

The most numerous derivatives of the following prefixes (in descending order of frequency of occurrence) are: ne-, re-, Ã®n-, des-, pre-, anti-, auto-, sub-, dez-, supra-, de- and Ã®m-. These 12 preï¬xes out of 42, form 88.2% of all derivatives with prefixes, recorded in eDCD (Petic, 2010b).

The most numerous derivatives of the following suffixes (in descending order of frequency of occurrence) are: -re, -tor, -toare, -ealÄƒ, -ie, -Äƒtoare, -iza, -oasÄƒ, -ar, -Äƒtor, -eascÄƒ, -os, -aÅŸ, -esc, -turÄƒ, -iÅ£Äƒ, -ist, -uÅ£Äƒ, -el, -i, -ui, -ÄƒturÄƒ, -eÅŸte, -ism, -a, -Äƒrie, -icÄƒ, -ime, -itate, -ioarÄƒ, -iÅŸor, -iÅŸoarÄƒ, -ic, -uleÅ£, -cÄƒ, -ean, -iÅŸ, -easÄƒ, -bil, -uÅ£, -at, -oaicÄƒ, -uÅŸor, -an, -oi, -uliÅ£, -iu, -enie, -istÄƒ, -al, and -ea. 51 out of 433 suffixes recorded in eDCD, form 87.7% of all derivatives with suffixes. The other suffixes have an insignificant number of derivatives.

The set of derivatives with a common root and meaning represents a lexical family. Studying the structure of eDCD entries and the particularities of the derivatives, an algorithm of the polynimial complexity O(n2logn), where n is the number of derivatives in eDCD, was elaborated and then implemented in order to extract lexical families. The most numerous lexical families are of the root bun (eng. good, 32 derivatives with the prefixes: strÄƒ-, Ã®m-, ne- and Ã®n-; and the suffixes: -el, -etÂ¸e, -Äƒtate, -ic, -uÈ›Äƒ, -icea, -icel, -icicÄƒ, -iÈ™oarÄƒ, -iÈ™or, -iÈ›Äƒ, -uÈ›, -re, -i, -ariÈ›Äƒ, -atic, -ealÄƒ, -eascÄƒ, -esc, -eÈ™te, -toare, -tor, and -ie); alb (eng. white, 25 derivatives with the prefix: Ã®n-; and with the suffixes: -eaÈ›Äƒ, -ei, -eÈ›, -ealÄƒ, -icioasÄƒ, -icios,-iliÈ›Äƒ, -re, -i, -ime, -ineaÈ›Äƒ, -ineÈ›, -ior, -iÈ™or, -itoare, -itor, -ie, -iturÄƒ, -iÈ›Äƒ, -ui, -uie, -uleÈ›, -uÈ™, and â€“uÈ›); È™arpe (eng. snake, 22 derivatives without any prefixes and with the suffixes: -ar, -aÈ™, -Äƒrie, -eascÄƒ, -esc, -eÈ™te,-iÈ™or, -oaicÄƒ, -oaie, -oi, -ui, -ealÄƒ, -re, -toare, -tor, -turÄƒ, -urel, and â€“uÈ™or); roatÄƒ (eng. wheel, 22 derivatives without any prefixes and with the suffixes: -ar, -easÄƒ, -ie, -it, -iÈ›Äƒ, -aÈ™,-at, -atÄƒ, -i, -cicÄƒ, -re, -tor, -toare, -turÄƒ, -ilÄƒ, -at, -iÈ™, -ocoalÄƒ, and -ocol ); om (eng. human, 20 derivatives with the prefixes: ne- and supra-; and with the suffixes: -ime, -oasÄƒ, -os, -oi, -uleÈ›, -uÈ™or, -eascÄƒ, -esc, -eÈ™te, and -ie) (Petic, 2010b). In the same way there are over 3000 roots with a single derivative. There were found 7 prefixes and namely a-, arhe-, para-, dis-, i-, im and Ã®ntru-, that are not attached directly to roots, but only to stems. Also there are several suffixes, that are not attached to root.

The relation between the affixes and the parts of speech of the derivatives

In order to elaborate formal generative models of derivatives with implicit morphological features it was necessary to study the dependence of the affixes to part of speech of derivatives. Thatâ€™s way we examined 3 models of derivation:

derivative=prefix+root/stem,

derivative=root/stem+suffix,

derivative=prefix+root/stem+suffix.

Besides this clasification it was observed that some affixes can and some cannot change the part of speech of a word in the process of derivation. We studied verbs, nouns, adjectives and adverbs as part of speech of the stems and the derivatives. Other parts of speech werenâ€™t studied because they are not frequent in the process of derivation.

As source for research served eDCD and RRTLN. For this purpose a special program was developed to extract the derivatives that can or cannot change the part of speech of the root/stem. The program is based on an iterative algorithm having O(nïƒ—m) complexity, where n is the number of derivatives in eDCD and m is the number of derivatives in RRTLN.

The model

Number of derivatives

% of changing

derivative=prefix+root/stem

730

15.6

derivative=root/stem+suffix

8076

63.1

derivative=prefix+root/stem+suffix

487

94.1

TOTAL/AVERAGE

9293

61.0

Table 1 â€“ The dependence of the affixes to part of speech of derivatives.

The obtaining results (Table 1) showed that the second model is the most frequent. The part of speech changes the most in the third model and the least in the first model, in the process of prefixation. Only 39% of roots/stems do not change their part of speech in the process of derivation.

Establishing the correspondence "inflection group â€“ derivation group"

In order to have a more efficient processing of the Romanian derivatives we studied an eventual possibility to group derivatives in correspondence with inflection groups. Initially, we analysed the possibility to group the derivatives with prefixes, then with suffixes.

The ideea was inspired from Serbian system (DuÅ¡ko and Krstev, 2005), where a correspondence "inflection group â€“ derivation group" was established. The results of this study can be included in the list of lexical constraints for the process of automatic lexical derivation.

The derivatives were extracted separately for every affix and were compared with the flection groups. Thanks to flection groups from DMLR and derivatives from eDCD there was made the attempt to detect the derivation groups. DMLR already consists of many derivatives.

First, the flection groups of the roots, which correspond to derivatives with prefixes without any suffixes, were set up. For every prefix there was set up the most frequent flection group of the derivatives roots. Second, the flection group of the roots that correspond to derivatives with prefixes that was first derived with suffixes was extracted.

In order to decide which roots can be attached the concrete affix from the morphological dictionary, special programs were developed, which extracted derivatives separately for every suffix and prefix, and after that, they were compared with the flection group.

Considering the computational linguistic resources that we work with we concluded that it is imposible to establish the derivatives groups based only on inflection groups. Nevertheless, the study ilustrates that it is possible to limit the number of inflection groups that would correspond to derivation with different prefixes or suffixes, in order to eliminate the inflection groups that do not reffer to the corresponding affix.

Automatic derivative recognition

In the sections above we processed the derivatives from eDCD, a resource that consists not only of the list of derivatives but also of its constituent morphemes. The situation is more sophisticated when we do not have the segmentation of the derivative. Thatâ€™s way it is important to have a mechanism of derivatives recognition.

In the process of derivative recognition it is possible to discover correct words that are not attested in the dictionaries. In addition the derivative recognition corresponds to affix detection, but an affix belongs to a concrete language. This also can help the language detection.

A lexicon is used as a source for automatic derivatives recognition, containing not only graphic representation of the Romanian words, but also their part of speech. This lexicon consists of approximately 100000 of word bases, and words can have several entries for different parts of speech. Besides the lexicon, set of prefixes with their phonological forms and suffixes were used.

Since not all the words end (begin) with the same suffixes (prefixes), some algorithms were elaborated for enabling the automatic extraction of the derivatives from the lexicon. The elaborated algorithms was based on the fact that x, y ïƒŽ ï“+, where ï“+ is the set of all possible roots. If y = xv then v is the suffix of y. If y = ux then u is the prefix of y. In this context both y and x must be valid words in Romanian language, and u and v are strings that can be affixes attested for Romanian language. The problem of consonant and/or vowel alternations was neglected in the case of the algorithm derivatives extraction. This fact does not permit the exact detection of all derivatives (Petic, 2010b).

Being more precise, the following word formation scheme expresses the particularities of prefixation:

ï› prefix [stem]p ïp

where p represents the part of speech for the stem and the derivative. Note that, in the process of prefixation the part of speech of derivative does not change. In the process of suffixation there are cases where the part of speech changes (for example, (a) citi â†’ cititor, in Romanian, (to) read â†’ reader, in English), as it is presenting in the following word formation scheme:

ï› [stem ]p1 suffix ïp2

The algorithm for automatic derivatives recognition was elaborated considering the peculiarities of the Romanian affixes and derivatives. The developed program based on this algorithm and tested on 300 of words from RRTLN database. The results of automatic derivatives recognition correspond to 76% of correct derivatives (Boian et al, 2011b).

RRTLN database extension for derivational process studying

Existent computational linguistic resources represent the main element in the process of automatic derivative generator development. In this case, lexicons do not constitute only simple resources of words, but they also consist of morphologic information and with reference to existent prefixes and suffixes with description (Boian et al, 2011a).

As there are not any universal algorithms to segment a derivative into morphemes, an idea to study the structure of RRTLN database appeared, in order to complete it with list of prefixes, suffixes, stems/roots and the relations among them.

First, there were selected those words that are present in eDCD. So, there were elaborated special program, with iterative algorithms of O(nïƒ—logm) complexity, where n is the number of words in eDCD and m â€“ the number of words in RRTLN database, which identified among the words from RRTLN over 13.000 of derivatives. Over 2000 of derivatives from eDCD are not present in RRTLN.

As we concluded before, in order to generate new words it is important to have the information about the part of speech of the derivative and of its root/stem. After the preprocessing of the existent computational linguistic resources, there was developed an extension to RRTLN database (using MySQL as database management system) based on the derivatives from eDCD. Thatâ€™s way separate tables in RRTLN were developed, completed with the information and identified automatically with the help of special developed programs and eDCD, which contributed to the study of the processes of automatic generation of derivatives (Boian et al, 2011a). As a result we created 4 tables that consist of:

A list with 41 of prefixes;

A list with 420 of suffixes;

A list with 22045 of roots/stems/derivatives

Relations among affixes and roots/stems to form 15297 derivatives.

As we can observe from new created tables, the information about the parts of speech is missing. This information can be extracted from the already existing tables in RRTLN, because of the uncertainty in a word, depending on its meaning can have several entries and as a result several parts of speech. Speaking about the table of relation between affixes and roots/stems, we can tell that it consists of reserved fields for 3 prefixes and 4 suffixes, because eDCD has derivatives with maximum 2 prefixes (for example, dez/rÄƒ/suci, eng. untwist, pre/Ã®n/noi, eng. restore) and 3 suffixes (for example, loc/al/iza/re, eng. localization) (Boian et al, 2011a).

Elaborating this structure and attaching to RRTLN, we can establish the functions and queries that will permit:

The extraction of the derivatives by a prefix, root/stem or suffix;

The extraction of lexical families for a root/stem;

The establishing of the part of speech of the derivative and of the root/stem;

The determination of the alternations that took place in the process of derivation, etc.

In conclusion, the structure and the content of RRTLN completed can serve for adding of an option to generate the derivatives, that it to be discussed in the next paragraph.

The consonantal and vocalic alternations

The problem of derivation consists not only in the detection of the derivational rules for each affix, but also in the examination of the concrete modifications that can appear at the level of root/stem or the affixes. As we applied an iterative algorithm of liniar complexity to eDCD and processed it, we established the missing of some modifications in the process of derivation with the following frequent prefixes: ne-, re-, pre-, anti-, auto-, supra-, and de-.

The model

Number of derivatives with modifications

Number of derivatives without modifications

derivative=prefix+root/stem

224

1134

derivative=root/stem+suffix

6381

6809

derivative=prefix+root/stem+suffix

191

632

TOTAL

6796

8575

Table 2 â€“ The dependence of the affixes to modifications in the process of derivation.

Possible modifications are so varied that it is difficult to describe all of them, but we can try to classify them (Petic, 2010a).

Let L be an alphabet of a natural language, V â€“ the set of vowels, C â€“ the set of consonant, r â€“ root/stem, p â€“ prefix, s - suffix, square brackets [ ] indicate that the included morpheme is not obvious in the word structure. After the analysis of the process of derivation we can conclude that there are the following modification rules in the process of derivation:

[p]r=[p]rï‚¢v ï‚® [p]rï‚¢s, where vïƒŽV, for example, aduna ï‚® adun(a)Äƒtor, corresponds r=aduna, rï‚¢=adun, v=a, s=Äƒtor.

[p]rs=[p]rlsï‚¢ ï‚® [p]rsï‚¢, where lïƒŽL, for example, bÃ®ntui ï‚® bÃ®ntui(e)alÄƒ, corresponds r=bÃ®ntui, s=ealÄƒ, l=e, sï‚¢=alÄƒ.

[p]rs=[p]rl1l2sï‚¢ ï‚® [p]rsï‚¢, where l1, l2 ïƒŽ L, for example, Ã®ncleia ï‚® Ã®ncleia(ea)lÄƒ, corresponds r=Ã®ncleia, l1=e, l2=a, s=ealÄƒ, sï‚¢=lÄƒ.

Besides the mentioned above modifications, there is a group of modifications that consist of regular changing of a letter by another (for example parÄƒ â€“ periÅŸor) or a group of letters by another group of letters (for example mustaÅ£Äƒ â€“ mustÄƒcios) in the process of derivation. They are called consonantal/vocalic alternations.

In general case derivation can be describe by a set of rules of the form:

ï¡1ï¢1ï¡2ï¢2 ... ï¡nï¢n ï‚® [p] ï¡1ï¢1ï‚¢ï¡2ï¢2ï‚¢ ... ï¡nï¢nï‚¢ [s], (1)

where ï¼ï¡iï¼ï‚³0, ï¼ï¢iï¼ï‚³0, ï¼ï¢iï‚¢ï¼ï‚³0, i=1,..n and in the case of consonantal/vocalic alternations there are the following three relations: ï¼ï¢iï¼=ï¼ï¢iï‚¢ï¼, ï¼ï¢i ï¼>ï¼ï¢iï‚¢ï¼, ï¼ï¢iï¼<ï¼ï¢iï‚¢ï¼ for i=1,..n.

Examples: atrage ï‚® atrÄƒgÄƒtor, the alternation a-Äƒ, corresponds to the case ï¼ï¢iï¼=ï¼ï¢iï‚¢ï¼, sÄƒgeatÄƒ ï‚® sÄƒgetaÈ™, the alternation ea-e, corresponds to the case ï¼ï¢iï¼>ï¼ï¢iï‚¢ï¼ and deÅŸtept ï‚® deÅŸteaptÄƒ, the alternation e-ea, corresponds to the case ï¼ï¢iï¼<ï¼ï¢iï‚¢ï¼.

Thus, we can classify the alternations in the following classes:

vocalic â€“ constitute 83%;

consonantal â€“ constitute 10%;

mixed â€“ constitute 7%.

After we counted the alternations, we concluded that the most frequent alternations are: a-Äƒ (60%), ea-e (8%), oa-o (6%), t-Å£ (3%) and d-z (3%), the rest of them have an insignificant number. So, only 5 types of alternations constitute 80% of all types of alternations.

All the established alternations are attested in the form of some contexts (a string representing a part of a word that determine the alternation), which allows finding of those derivatives, that have vocalic or consonantal alternations.

Multilingual approaches of derivatives generation

Establishing candidate words for derivation

To develop algorithms for automatic generation of derivatives it is necessary to determine whether a word is a candidate to be derived or not. At this stage we verify whether a sequence of characters represents a correct word in Romanian language and if from this word we could generate other derivatives. A common feature of systems built for different languages is the use of computational linguistic resources, from which the process of automatic generation of words is started (Carota, 2006). However, in the case of automatic derivation algorithm, computational linguistic resources function is not used in extraction of derived words, but to generate derivatives. Resources also contribute to the process of validating the derived words generated automatically. In this way the initial sequence of characters can be verified initially in RRTLN. If the sequence of characters is not found in the mentioned resource, it will be verified using Internet resources (Petic et al., 2011).

After the set was fixed for derivation, the application of models of derivation follows. A distinction of the presented approaches to those of other languages is the lack of semantic information in computational sources, with which we operated. The most important patterns of derivation that do not involve the use of semantic information are the following: affix substitution, derivation projection, formal models of derivatives derivation and derivational constraints.

Affixes substitution

The ideea is inspired from Serbian derivational morphology (DuÅ¡ko and Krstev, 2005), where the generated derivatives have predictable meanings, in our case the gender modification by suffix substitution, e. g., muncitor ï‚« muncitoare (eng. worker), and in the case of prefix substitution there is meaning change, e. g., antebelic ï‚« postbelic (eng. pre-war â€“ after-war).

Affixes substitution is not specific only for Romanian and Serbian derivational morphology, but also for other European languages, e. g., Spanish (e. g., amortizar-amortizable, eng. to amortize-redeemable), French (e. g., revoir-prevoir, eng. revise-foresee), Russian (e. g., Ð¿Ñ€Ð¾Ñ‡Ð¸Ñ‚Ð°Ñ‚ÑŒ-Ð´Ð¾Ñ‡Ð¸Ñ‚Ð°Ñ‚ÑŒ, eng. read â€“ read till the end) etc.

In general case for suffix substitution, let x1 be a word of the form x1=ï·ï¡1 with the suffix ï¡1. After the substitution ï¡1ï‚®ï¡2 we obtain the word x2=ï·ï¡2, e. g., corigenÅ£Äƒ-corigent (eng. the failed - second examination). In the case of prefix substitution, let x1 be a word of the form x1=ï¡1ï·, where ï¡1 is a prefix. After the substitution ï¡1ï‚®ï¡2 we obtain the word x2=ï¡2ï·, where x2 is the obtained derivative, e. g., Ã®nchide-deschide (eng. to close â€“ to open) (Petic, 2011).

From the information above a new and original algorithm of O(nïƒ—logm) complexity, where n â€“ number of words in lexicon, and m â€“ number of pairs of affixes for substitution was developed. The algorithm consists in examining the words in the lexicon and substituting of the affixes in those cases that correspond to the categories established by the above-mentioned rules.

Formal models

Formal models of derivation rules represent the basis of what can generate derivative words with a high degree of accuracy. A similar approach in derivational morphology is found in French language (Fiammetta and Dal, 2000). While French system works with only 3 suffixes (-able,-ite,-is (er)) for which rules have been found, Romanian derivational morphology works with 3 prefixes (ne-, re-, in-/im-) and 2 suffixes (-re,-iza).

Rules for prefixes:

re- [Ï‰]inf â†’ [re [Ï‰]inf]inf

ne- [Ï‰â€™b]adj â†’ [ne [Ï‰â€™b]adj]adj b ÃŽ {-tor, -bil, -os, -at, -it, -ut,-ind, -Ã®nd }

in-/im-=g [Ï‰â€™b]adj â†’ [g [Ï‰â€™b]adj]adj bÃŽ {-bil, -ent, -ant}

Rules for suffixes:

re [Ï‰]inf â†’ [[Ï‰]inf re]subst

-iza [Ï‰â€™ba]adj â†’ [[Ï‰â€™b]adj iza]inf

The linear algorithm examines the words in the lexicon and attaches the affixes to those which correspond to the above-mentioned rules.

Derivativesâ€™ projection

The projection of derivatives represents a method of word formation of the prefixed words from the suffixed words of the same root. According to Spanish researchers, the Spanish verb amortizar can be derived with the prefix des- obtaining desamortizar. Also, the word amortizar can be derived with suffixes â€“cion and â€“able. So, the derivative with prefix des- can derive with the suffixes â€“cion and â€“able. The hypothesis is that derivatives can inherit/project the derivatives with suffixes of the stem whose prefixation was already realized (Santana, 2004). This method is specific not only for Spanish, but it can also be applied to other languages; e. g., in English from the root read we can form derivatives readable and unread; thatâ€™s way it is possible to form the derivative unreadable.

Generalising the above noted, we conclude that it is possible to present in a formal way the mechanism for Romanian derivational morphology. Let ï· be a Romanian word, ï¡ - its prefix and ï¢ - its suffix. Then, the following relation are valuable (Petic, 2011):

(ï·ï‚®ï¡ï·)ïƒ™(ï·ï‚®ï·ï¢)ïƒž(ï·ï‚®ï¡ï·ï¢),

for example, (a lucra ï‚® a prelucra) ïƒ™ (a lucra ï‚® lucr(a)Äƒtor) ïƒž (a lucra ï‚® prelucr(a)Äƒtor);

(ï·ï‚®ï¡ï·)ïƒ™(ï·ï‚®ï¡ï·ï¢)ïƒž(ï·ï‚®ï·ï¢),

for example, (a capitulaï‚® recapitula) ïƒ™ (a capitula ï‚®recapitulaÅ£ie) ïƒž (a capitulaï‚®capitulaÅ£ie)

(ï·ï‚®ï¡ï·ï¢)ïƒ™(ï·ï‚®ï·ï¢)ïƒž(ï·ï‚®ï¡ï·),

for example, (a centraliza ï‚® descentralizator) ïƒ™ (a centraliza ï‚® centralizator) ïƒž (a centraliza ï‚® descentraliza);

Examining the words in the lexicon and verifying them in correspondence with relations above, a new and original algorithm has been developed that generates derivatives by affixesâ€™ projection.

Derivational constraints

Where there is no clear model, according to which it would be possible to generate derivatives, some preconditions will appear, called derivational constraints. The most common derivational constraints are: parts of speech, inflection classes, affixes, the changes that take place in the case of derivation and the letters preceding/succeeding prefixes/suffixes. So, derivational constraints represent some schemes with several parameters that reduce the class roots and affixes in order to form derivatives. E. g. functions of the form:

f: {wrd, pos, mod, sla, fgw, mvca} ï‚® derivative

where wrd is a word to derivate, pos - part of speech of wrd, mod - model of derivation, sla - the set of letters to which the affix is attached, fgw - flection group of wrd, mvca - modifications and vocalic or consonant alternations (Petic, 2011).

Examining the words in the lexicon and verifying them in correspondence with relations above, a new and original algorithm that generates derivatives by derivatives constraints has been developed.

As examples of generating derivatives by the derivation constraints can serve as automatic derivation of words with the prefix des- and suffixes -bil and -ime.

f: {a spinteca, verb, des<verb>, ...s..., V14, double consonant avoiding } ï‚® de(s)spinteca.

f: {a programa, verb, <verb>bil-itate, ...a..., V201, ... } ï‚® programabilitate

f: {crud, adjectiv, des<adjectiv>, ..., A3, consonantal alternation d - z } ï‚® cru(d)zime.

Therefore, derivational constraints necessary for the automatic generation process do not depend only on the type of the affix, but also on the value of the prefix or suffix. Moreover, each language has its own peculiarities in the derivation of words.

Validating of the generated derivatives

Automatic derivation represents an overgenerating mechanism. That is why validation of generated words is needed. One of the methods of validation of new word consists in manual verification of every new generated derivative as to correspond to semantic and morphologic rules. If the validation is performed by a specialist in domain, there can appear disadvantages of a manual work like: considerable resources of time and the possibility to make mistakes. So, this method of validation becomes inefficient (Cojocaru et al., 2009).

Another method of validation consists of the verification of the derivatives in the existent electronic documents. There are different types of electronic documents.

The idea to validate words using existent corpora that represent verified documents seems to be the best solution. The condition for being the panacea in the new word validation is a representative corpus, with a large number of words from different domains.

On the other hand there are documents on Internet, that are not verified, that are why they are not credible. In order to make it more precise, the searching on the Internet, using Google.com search engine, should be made for the documents typed only in a specified language. Besides this, it is necessary that the following are assured: the possibility to exclude word segmentation; the part of speech of the derivatives.

This validation tool divides the generated derivatives in three categories. The first one contains words that are not found by Google.com searching engine. The second consists of the derivatives that appear less than a frequency limit of n, in our case n=1000. Derivatives that are more frequent that limit n, are registered in the third group. This classification pretends that the words, that are listed more than the frequency limit of n, are surely valid. Those, which are from the second group, can be valid but should be verified by specialists in linguistics. The derivatives, that are not present, could not be valid (Petic et al., 2011). The idea of classification pretends to be a mixed method of validation.

As an example of applying the validation technique the derivation with suffix -ime was chosen. After we apply the algorithm of automatic generation of derivatives with suffix -ime it was obtained a list of 2841 possible derivatives. After the process of verification of the frequency of these words we established that only 120 of them have a nonzero frequency, that constitute 4,22%. Table 3 illustrates the results registered in the process of validation.

Value of n

Number of generated derivatives

Number of valid derivatives

Accuracy of validation

2721

1 â€“ 100

48%

101 â€“ 1000

41%

1001 â€“ 10000

25%

10001 - ...

93%

TOTAL 1 - ...

120

57%

Table 3. - Statistics concerning validated derivatives with suffix -ime.

The data presented in Table 3 proves the idea that there is not any universal validation method using Web documents, but it permits to filtrate a significant number of derivatives, the rest of them needs a manual verification. In the case of the suffix -ime, there were registered 38 valid derivatives from 84 that constitute 45%, but in the case of a frequency that is bigger than 1000, the accuracy is 83%. Thus the accuracy is better when the frequency is bigger although it does not guarantee a perfect result, because the Web documents are not very credible from the linguistic point of view (Petic, 2012).

Algorithm of automatic lexical derivation

Analysing all the information above, as well as features and process of derivation, we will described the algorithm for automatic generation of Romanian lexical families using the following notations: cvt â€“ word from which will be generated lexical family; DRRTLN â€“ set of words from the lexicon RRTLN; DeDCD â€“ set of derivatives of the word cvt existing in eDCD; DSA â€“ set of derivatives formed by affix substitution (procedure applied to DeDCD words); DPD â€“ derivatives formed by derivative projection (process applied to the words included in DSAïƒˆDeDCD); DCL â€“ set of derivatives formed by derivational constraints (process applied to the multitude of words DPDïƒˆDSAïƒˆDeDCD); DRD â€“ set of derivatives formed by formal derivation rules (process applied to the multitude of words DCLïƒˆDPDïƒˆDSAïƒˆDeDCD); Dgen=DCLïƒˆDPDïƒˆDSAïƒˆDRD â€“ set of words obtained by the automatic generation; DNEVAL â€“ set of words considered invalid that are not found in Internet documents; DSEMVAL â€“ final set of words that require manual validation; DVAL - final set of words that have a sufficient frequency in Internet documents to be considered valid and represents the lexical family of the word cvt. Given these notations we can write the corresponding algorithm (Petic, 2011) in a conventional language:

Input: cvt

Output: DVAL

if ({cvt}ïƒ‡DRRTLNâ‰ ïƒ†)then goto 3

else goto 2;

if (cvt in Internet)

then read_part_of_speach(cvt);goto 3;

else DVAL :={}; goto 7;

3. Generating sets of words

3.1. cvt ïƒž DeDCD;

3.2. DeDCD ïƒž DSA;

3.3. DSA ïƒž DPD;

3.4. DPD ïƒž DCL;

3.5. DCL ïƒž DRD;

Dgen:= DCLïƒˆDPDïƒˆDSAïƒˆDRD

Automatic validation

DNEVAL := nonval(Dgen);

DSEMVAL := semval(Dgen);

DVAL :=val(Dgen);

DVAL := DVAL + manualval(DSEMVAL)

Write(DVAL);

endalgorithm

This algorithm does not exceed the polynomial complexity, without any requirements concerning the calculation system. Both time and memory capacity for derivatives generation are insignificant comparative with the process of manual validation, that is impossible to be established. After the counting of all the generated derivatives that have been validated, we can obtain 11191 derivatives with prefixes and suffixes that represent 87.27% of those that are generated (Table 4).

Generated

Validated

% accuracy of validation

Prefixes

10093

8839

87,57

Suffixes

2730

2352

86,15

Total

12823

11191

87,27

Table 4 â€“ Final results referring to the number of automatically generated derivatives.

Analysing Table 3 we can emphasize a large number of derivatives obtained by prefixation comparing to suffixation. It can be explained by extremely productive properties of the prefix ne-. The similarities between procentual values of accuracy at the derivatives generation both by prefixation and suffixation should be mentioned.

Computational linguistic resources completion

As a result of applying proposed methods and executing the programs, developed based on them, there were generated 11191 new words, considered valid. After the process of inflection of these derivatives there were obtained 123106 of words. In this way we obtained a saturated lexicon from the point of view of derivation, comparing to the processed lexicons.

RRTLN initial

RRTLN derivationally saturated

Lemma words

97803

108994

Derivatives

15692

26883

Flectional words

1075876

1198982

Table 5- Statistics about RRTLN data base completion.

So, following the presented scheme from the Figure 1 the volume of RRTLN resources grew with 11.4%. Actually RRTLN consist of 108994 lemma-words where 26883 are derivatives, forming 1199892 flectional words (Table 5).

Conclusion and perspectives

Studies on derivation process allow us to conclude that we cannot propose an effective algorithm for automatic derivation in general, but we can highlight some models of derivation, for which construction of such algorithms is possible.

The new derivatives validation is one of the steps in automatic derivation that raises many questions. So, as it is difficult to set up the criterion for word validation by means of Internet, it is important to use the digital dictionary of the derivatives, which will allow the establishing of morphemes of the derivatives with its type (prefix, root and suffix).

Acknowledgments

This article is carried out as part of the project ref. nr. 12.819.18.09A supported by Supreme Council for Science and Technological Development from Republic of Moldova.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now