Mining Functional Dependencies In Database

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Dhanyamol Antony

Dept. of Computer Science

SCT College of Engineering

Trivandrum, India

[email protected]

Rejimoan R

Dept. of Computer Science

SCT College of Engineering

Trivandrum, India

[email protected]

ABSTRACT— The determination of functional dependencies is an important part of designing databases in the relational model, and in database normalization and denormalization. It is important in knowledge discovery, database semantics analysis, database design, and data quality assessment. Under the circumstance of semantics unknown or data relationship difficult to be defined, the problem we have to face is how to mine all possible full functional dependency behind datum. This paper reviews the different methods for discovering functional dependency in relational databases because of the importance of dependency discovery. Here we give an overview to four top-down search approaches for mining functional dependencies.

KEYWORDS: Functional dependency, Rough set, Formal Concept Analysis.

Introduction

Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them based on the functional dependency (FD) exist among the prime and non-prime attributes. It is used for creating a good database design with minimum redundant, maximum consistent data with minimum amount of anomalies. The process of normalization was first formalized by E.F.Codd. It is carried out by conducting a series of tests on databases to check whether it satisfies certain conditions. Initially there were three normal forms introduced and up to eight normal forms were proposed till now. But in practice, databases are normalized up to third normal form. The normal forms rely on functional dependencies among the attributes except the 1NF. The definition of functional dependency(FD) states that, given a relation R, a set of attributes X, in R, is said to functionally determine another attribute Y, also in R, (written as X ->Y) if and only if each X value is associated with at most one Y value. That is, given a tuple and the values of the attributes in X, one can unequally determine the corresponding value of the Y attribute. It is customarily to call X, the determinant set and Y, the dependent attribute.

There are several properties derived for functional dependencies and the most important among them is the Amstrong’s axioms. These axioms are very helpful for database normalization.

Subset Property (Axiom of Reflexivity): If Y is a subset of X, then X -> Y

Augmentation (Axiom of Augmentation) : If X -> Y, then XZ YZ

Transitivity (Axiom of Transitivity) : If X -> Y and Y ->Z, then X-> Z

By the repeated application of these rules, all functional dependencies can be identified.

Discovery of Functional dependencies

Different approaches are identified to determine the functional dependencies among the attributes. Some of them are described here.

Rough set approach

Formal Concept analysis

FD_Mine

TANE

Rough set approach

The rough set is a new mathematical approach to imprecision, vagueness and uncertainty, proposed by Zdzisiaw Pawlak in 1982. Rough set is useful for domains where data collected are imprecise and/or incomplete about the domain objects. The indiscernibility relation [1] is the mathematical basis of rough set theory. Two basic operations in rough set theory are upper approximations and lower approximations. The indiscernibility relation in rough set is defined as

I( B ) ={(x, y)ϵ U U : fa ( x ) = fa ( y ), a B} (1)

where U is a finite set of objects called the universe and A is finite set of attributes. The set of values Va is associated with every attribute a∈ A and each attribute A determine a function a fa :U →Va .

Each subset of attributes determines a classification of all objects into classes having the same description in terms of these attributes.

The different operations on rough set is defined as

B* (X ) = {x∈U : B(X )⊆ X} (2)

B* (X ) = {x∈U : B(X )∩ X ≠ Φ} (3)

where B*( X ) and B*( X ) are called the B-lower and the B-upper approximation of X respectively.

Now the dependency using rough set concept can be defined as follows. Let B and C are subsets of A. B is said to be fully depends on C, if and only if I(C) ⊆ I( B ) . That is the partition generated by C is finer than that generated by B. For example, to find out whether the dependency CB holds, compare the indiscernibility relation I(B) and I(C) and check if I(C) is finer than I(B). Intuitively, |I(C)|>|I(B)|. The dependency can be calculated using the following formula;

K= (4)

where POS ( B )= (5)

If k=1, it means CB.

The functional dependency mining algorithm based on roughset can be concluded into two steps:

Generate hypotheses about possible functional dependency from data sample. Then, verify these hypotheses against the data sample.

Validate hypotheses row by row.

In the first step the aim is to minimize the number of hypothesis. Here a top-down structured search method is used to generate the hypotheses layer by layer. Different combinations of attributes are selected as the possible left hand side (lhs) of FD, called Ls. The number of layer is equal to the attribute number of combination. Suppose B’ is subset of B, and B* is subset of B which only has single attribute. The hypotheses B(R) should be the set of A\(BB'(R)). Thus partial functional dependency can be avoided to some extent.

For achieving the second goal validate the hypothesis row by row. If a row which does not validate the hypothesis exists, indicates the hypothesis is not valid. When the tuples of relation are very large, the checking workload will be overloaded. The comparison is simplified by comparing among tuples into that of equality classes with the use of the indiscernibility relation. If it is found that |I(B)|=|U|, then B is candidate key. It can determine any other attributes, i.e. B(R)=A\B*.The table in a database contains at least one candidate key. The classification make candidate key appears rapidly thus the process of mining will be speed up.

Rough set approach is based on top-down structured search for discovering functional dependencies from hypotheses that need to be verified. It reduces the unnecessary hypothesis by removing partial dependencies and avoids testing redundant dependency by pre-analysing existing functional dependencies. The rough set based algorithm improved the verifying efficiency by comparing attributes’ classification capacity.

Formal concept Analysis

Formal Concept Analysis (FCA) is used to mine functional dependencies in a relational database table. It explores the conceptual knowledge contained in a database by analysing the formal conceptual structure of the data [2]. The newness of this method is that it uses an inverted index file to optimize the construction of formal context of functional dependencies.

Formal Concept Analysis (FCA) introduced by [3] gives a mathematical formalization of the concept notion. It has been proved to be a valuable tool to represent the knowledge contained in a database, for instance logical implications in datasets. Hereth (2002) presents the relationship between FCA and functional dependencies. In this approach the database is defined with following notation: A relational database to be a tuple D := (dom,N) with dom being the domain of the database and N being the set of named data tables in the database. A data table is any element T iN0(domi).The arity of T is i N0 such that TP(domi) and is written arity(T). For a tuple t T we write t [j] with 1 j arity (T) to denote the j-th value of the tuple.

In database implementations the named perspective is used. The database schema is composed of the table names with their attribute names. For example, the relational scheme of a university database. Students are divided in groups; there can be many groups in one specialization. Students are marked at different disciplines.

Specialization [SpecID, SpecName, Language]

Groups [GroupID, SpecID]

Students [StudID, GroupID, StudName, Email]

Disciplines [DiscID, DName, CreditNr]

Marks [StudID, DiscID, Mark]

In this example N is composed of the named data tables: Specialization, Groups, Students, Disciplines, Marks is the set of all attribute's values of the tables. In case of each table the arity is the number of its attributes, so Groups table has arity 2, Students has 4. Let t be a tuple (row) of table Students, then t[1] is the value of tuple t for StudID, t[2] is the value of GroupID in the corresponding row, so on.

First step for identifying the functional dependency is to construct a new table structure with some significant tuples. That is, it selects varied tuples, with different styles of data, in order to get as many conclusions as possible. In order to define the functional dependencies in a formal context,the following three definitions from [2],[6] are used.

Definition 1: Let T be a data table and X; Y N. Then, T fulfils the functional dependency D: X Y, if for all tuples s, t T x (s) = x (t) implies that also y(s) = y(t)

Definition 2. (Power Context Family). A power context family k͂=(kn)nk0 is a family of formal contexts kk(Gk,Mk,Ik) such that Gk(G0)k for k = 1, 2,.. The formal contexts kk with k 1 are called relational contexts. The power context family k͂ is said to be limited of type n ϵ N if k͂= (k0,k1,k2…), otherwise, it is called unlimited.

Definition 3. The power context family k͂(D) resulting from the canonical database translation of the relational database D=(dom,N) is constructed in the following way:set K0:=(dom,φ,φ) and, for k 1, let Gk be the set of all k-ary tuples and Mk N be the set of all named data tables of arity k. The relation Ik is defined by (g,m) Ik : g m.

The first step in this approach is to design the database table with a new structure. In this step all rows of the data are not considered. Instead varied tuples, with different styles of data are selected, in order to get as many conclusions as possible. The construction of the formal context can be optimized by constructing an inverted index files for the values of every attribute. With inverted indexes the number of rows in the data file of formal context resulted by this method is half of the same value resulted using Hereth's method in [4]. Thus the time to build the concept lattice for functional dependencies can be reduced and the useless dependencies are eliminated.

An inverted index (or inverted file) is an index data structure with a sequence of (key, pointer) pairs where each pointer points to a record in a database which contains the key value in some particular field. For databases in which the records may be searched based on more than one field, multiple indices may be created.

A lattice is built for the context of functional dependencies is constructed. The top of lattice is the tuple pairs in which there exist no common values of the corresponding attributes. Hence these pairs of tuples generation can be omitted. The bottom of the lattice consist of pairs of the form (t, t), where t is a tuple of the table, have all attributes in common. They can be omitted too as they don't change the implications in the context. Finally from the resulted context, the functional dependencies can be generated.

FD_Mine

The equivalences among the attributes are identified from the database tables by exploiting the Armstrong’s axioms. This can be used for reducing the size of dataset and the number of functional dependencies to be checked. The number of functional dependencies to be checked is reduced by skipping the search for FDs that are logically implied by already discovered FDs. It uses a level-wise search, where results from level k are used to explore level k+1. Initially, all FDs X->Y, where X and Y are single attributes are identified at level 1 and stored in FD_SET at first level, termed as F1. The set of candidates that are considered at this level is denoted L1. F1 and L1 are used to generate candidates Xi Xj of L2. At second level, ie at level 2, all FDs of the form Xi Xj ->Y are identified and stored in FD_SET F2. F1, F2, L1, and L2 are used to generate the candidates of L3, and so on, until no candidates remain, i.e., Lk= (k ≤ n-1).

The FD_Mine algorithm is described here. All candidates at level k are generated first. The partitions of each candidate at level k are calculated next. Then for each X∈Ck−1 (candidate at level k-1), all FDs of the form X → Vi are checked. The equivalence candidates for this level will be generated. Now the pruning rules are applied to delete the candidates from Ck. i.e the equivalences generated are used for pruning. FD_Mine prunes only redundant FDs that are inferred from Fk using Armstrong’s Axioms

There are four pruning rules that can be used for finding FDs satisfied by a relation r(U) for candidates X and Y, where X φ, Y φ, and X,Y ⊂ U.

Pruning rule 1 If X ↔ Y is satisfied by r(U), then candidate Y can be deleted.

Pruning rule 2 If Y+ ⊆ X, then candidate X can be deleted.

Pruning rule 3 Given X∗ and Y∗, then when attempting to determine whether or not the set of FDs XY → vi,where vi ∈ U−XY, is satisfied by r(U), only the the set of FDs XY → vi, where vi ∈ U −X+Y+, needs to be checked in r(U).

Pruning rule 4 If ∀vi ∈ U − X, X → vi is satisfied by r(U), then candidate X can be deleted.

These four pruning rules are used to delete candidates at level k − 1 before generating candidates at level k. This process will continue until there are no candidates remaining. Thus it results the functional dependencies that exist in the relation.

TANE

TANE [5] is an efficient technique for identifying functional dependency from databases. TANE is based on partitioning the set of rows with respect to their attribute values, which makes testing the validity of functional dependencies fast even for a large number of tuples. The aim of this approach is to discover all minimal non-trivial FDs that hold in a relation r. This is done by considering sets of tuples that agree on some set of attributes. The dependency X → A is said to be hold if all tuples that agree on X also agree on A. The method is to check whether the tuples agree on the right hand side (rhs) whenever they agree on the left hand side (lhs). The equivalence classes and partition principles are used here.

Two tuples t and u are equivalent with respect to a given set X of attributes if t[A]=u[A] for all A in X. Any attribute set X partitions the tuples of the relation into equivalence classes. The equivalence class is denoted for a tuple t ∈ r with respect to a given set X ⊆ R by [t ] X , i.e. [t ] X = {u ∈ r | t [ A] = u[ A] for all A ∈ X }. The set πX = {[t ] X | t ∈ r } of equivalence classes is a partition of r under X . That is, πX is a collection of disjoint sets (equivalence classes) of tuples, such that each set has a unique value for the attribute set X and the union of the sets equals the relation r . The rank

|π | of a partition π is the number of equivalence classes in П.

The partition refinement gives a real view to the functional dependencies. A partition π refines another partition π’ for every equivalence class i, π is a subset of some equivalence class of π’. i.e. a functional dependency X→A holds if and only if πX refines π{A}[5]. It indicates that, in order to test whether X → A holds or not: check if |πX | = |πX ∪{ A}|. If πX refines π{ A} , then πX ∪{ A} equals πX . On the other hand, since πX ∪{ A} always refines πX , πX ∪{ A} cannot have the same number of equivalence classes as πX unless πX∪{ A} and πX are equal.

TANE starts the discovery of functional dependencies from singleton sets of attributes and works its way to larger attribute sets. It checks only the dependencies of the form X \ { A} → A, where A ∈ X . Thus it ensures that only non-trivial dependencies are considered and small-to-large direction of the searching guarantees that only minimal dependencies are output. It helps to prune the search space efficiently. Here it uses level wise search algorithm for achieving the minimal functional dependencies which reduces the computation on each level by using the results from the previous levels. In order to test whether the dependency X \ { A}→ A is minimal, it is important to check whether Y \ { A}→ A holds for some proper subset Y of X . This information is stored in the set C(Y) of rhs candidates of Y. For eg: consider the set X = { A, B , C } with {C }→ A is a valid dependency. Since {C} → A holds, which indicates A /∈ C({ A, C }) = C( X\{B }), which tells that {B , C }→ A is not minimal.

It also considers the fact that if C( X ) = ∅, then C(Y)=∅ for all supersets Y of X . Thus no dependency of the form Y \ { A} → A can be minimal and the set Y need not be processed at all. The breadth-first search in the set containment lattice can use this information effectively. Now improved rhs+ candidates C+( X ) that prune the search space more effectively is also considered: i.e.

C+( X ) = { A ∈ R | ∀B ∈ X :X \ { A, B }→ {B } does not hold}.

A can equal B, so the rhs+ candidates can be used to test the minimality of a dependency. The lemma[5] given below is useful to test the minimality of a dependency using the rhs+ candidates.

Lemma1: Let A ∈ X and let X\{ A}→ A be a valid dependency. The dependency X\{ A} → A is minimal if and only if, for all B ∈ X , we have A ∈ C+( X \{B })

There are two advantages of rhs+ candidates over initial rhs+ candidates. First, a B may be encountered for which AC+( X \ {B }) and stop checking earlier, saving some time.

FIGURE 1. A pruned set containment lattice for { A, B, C, D}.

The above figure shows the pruned set containment lattice for {A,B,C,D}. Only bold parts are accessed by the algorithm, since the element B is deleted.

The method can be summarized in the following way [5]. In order to find all the minimal non-trivial functional dependencies the set containment lattice is searched in a level wise manner. A level Ll is the collection of attribute sets of size such that the sets in Ll can potentially be used to construct dependencies. It will start with L 1= {{ A} | A ∈ R}, and computes L2 from L1, L3 from L2 and so on. Then the dependencies at level Ll are calculated. Then the search space is pruned by deleting the sets from Ll. Now the dependencies at level Ll+1 is generated. This procedure is continued until all FDs are explored.

III Conclusion

In this paper we review the methods for finding the functional dependencies exist in a relation. The dependency discovery problem has an exponential search space to the number of attributes involved in the data. In most cases data contain FDs with single or a few attributes on the lhs. All of the above mentioned methods are based on top-down approach.

The rough set approach is used to discover functional dependencies from hypotheses that need to be verified. It reduces the unnecessary hypothesis by removing partial dependencies and avoids testing redundant dependency by pre-analysing existing functional dependencies. The FCA method is helpful to construct the context of functional dependencies from any type of database table and, there is no need to introduce the tuples by hand. It is also capable to propose the structure of the database tables. The FD_Mine and TANE are used to identify the FDs in a level wise manner. The partition method is used for these two approaches. TANE is best when dependencies are very small. When the dependencies are larger, the level wise method starts the search from small dependencies. Both TANE and FD Mine use the essential pruning rules, but FD Mine uses symmetric FDs too.

IV References

[1] Ying qu, Xiao-bing fu "Rough set based algorithm of discovering functional Dependencies for relation database", 2008

[2] katalin Tunde Janosi Rancz and Vorica Varga, "A method for mining functional dependencies In relational database design using FCA", 2008

[3] Ganter, B., Wille, R.: "Formal Concept Analysis. Mathematical Foundations". Springer, Berlin-Heidelberg-New York. (1999)

[4] Wille, R. : Restructuring lattice theory: an approach based on hierarchies of concepts. In: I.Rival (ed.): Ordered sets. Reidel, Dordrecht-Boston, (1982) 445

[4] Hereth, J.: Relational Scaling and Databases. Proceedings of the 10th International Conference on Conceptual Structures: Integration and Interfaces LNCS 2393, Springer Verlag (2002) 62

[5]yka¨huhtala, Juha Ka rkka inen, Pasi Porkka and Hannu Toivonen, "TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies",1998

[6] Hereth, J.: "Relational Scaling and Databases. Proceedings of the 10th International Conference on Conceptual Structures": Integration and Interfaces LNCS 2393, Springer

Verlag (2002) 62-76

[7] Garret Birkhoff: "Lattice theory": American mathematical theory Colloquium publications, volume 25



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now