DISTRIBUTED QUERY PROCESSING, Characteristics of Query Processors




Distributed query processing is the procedure of answering queries (which means mainly read operations on large data sets) in a distributed environment where data is managed at multiple sites in a computer network.

A query in a DDBMS may require data from more than one site. The transmission of this data entails communication costs. If some of the query operations can be executed at the site of the data, they may be performed in parallel.

When the DDBMS supports fragmentation transparency, the user performs queries only on the global relations.

In distributed database, a query over global relations (called a global query) has to be expressed in terms of the queries over fragments called fragment query) of the relations using join and union operation.

Characteristics of Query Processors

Languages: The input language to the query processor can be based on relational calculus or relational algebra. With object DBMS, the language is based on object calculus which is merely an extension of relational calculus. Thus, decomposition in object algebra is also needed.

In a distributed context the output language is generally some internal form of relational algebra augmented with communication primitives. The operations of the output language are implemented directly in the system.

Types of Optimization: Query optimization aims at choosing the best point in the solution space of all possible execution strategies. An immediate method for query optimization is to search the solution space, exhaustively predict the cost of each strategy, and select the strategy with minimum cost. This method is effective in selecting the best strategy; it may incur a significant processing cost for the optimization itself.

Optimization Timing: A query may be optimized at different, times relative to the actual time of query execution. Optimization can be done statically before executing the query. Static query optimisation is done at query compilation time. Thus the cost of optimisation may be amortized over multiple query executions. Therefore, this timing is appropriate for use with the exhaustive search method.

Statistics: The effectiveness of query optimisation relies on statistics on the database. Dynamic query optimisation requires statistics in order to choose which operations should be done first. Static query optimisation is even more demanding since the size of intermediate relations must also be estimated based on statistical information. In a distributed database, statistics for query optimization typically bear on fragments, and include fragment cardinality and size as well as the size and number of distinct values of each attribute. To minimize the probability of error, more detailed statistics such as histograms of attribute values are sometimes used at the expense of higher management cost.

Decision Sites: Most systems use the centralised decision approach, in which a single site generates the strategy. The centralised approach is simpler but requires knowledge of the entire distributed database, while the distributed approach requires only local information. Hybrid approaches where one site makes the major decisions and other sites can make local decisions are also frequent.

Network Topology: The network topology exploited by the distributed query processor. With wide area networks, the cost functions to be minimized can be restricted to the data communication cost, which is considered to be the dominant factor. This assumption greatly simplifies distributed query optimisation, which can be divided into two separate problems. Selection of the global execution strategy, based on intersite communication, and selection of each local execution strategy, based on a centralised query processing algorithm.

Use of Semi joins: The semi join operation has the important property of reducing the size of the operand relation. When the main cost component considered by the query processor is communication, a semi join is particularly useful for improving the Processing of distributed join operations as it reduces the size of data exchanged between sites. However, using Semi joins may result in an increase in the number of messages and in the local processing time.