CS5481 Data Engineering Assignment 2
Deadline: 25‐OCT‐2019 (Friday), noon (no late submission will be accepted)
Points to note:
Different books may have slightly different descriptions of concepts, algorithms and
terminologies. To ensure fair assessment and uniformity in marking, you must follow the
convention used in the lecture slides or our textbook (Database System Concepts). Other
conventions will not be accepted.
Students are expected to generalize the concepts they have learnt during the lecture in
order to finish the assignment.
You must show the steps clearly. The marker will not give you marks if he cannot
understand your work.
This is an individual assignment. You must work on your own. Check
http://www6.cityu.edu.hk/ah/plagiarism.htm for “The Problem of Plagiarism”.
Submit the file to Canvas on or before the deadline. The file type must be either .docx file or .pdf file.
Use your student ID(s) to name the file, such as 5xxxxxxx.docx or 5xxxxxxx.pdf.
Query Processing and Optimization
Consider the following database schema.
Category (category-no, category-name, …)
Reader (reader-id, gender, age, grade-no, …)
Book (book-id, c-no, …)
Loan-record (r-id, b-id, loan-date, …)
The attributes r-id and b-id in Loan-record are foreign keys referencing reader-id in Reader
and book-id in Book, respectively. The attribute c-no in Book is a foreign key referencing
category-no in Category.
Suppose ONLY the following information and statistics are available and assume that all
distributions are uniform and independent of each other.
Number of categories (nCategory): 280
Number of readers (nReader): 50,000
Number of books (nBook): 4,500
Number of loan records (nLoan-record): 60,000
Blocking factor of Category (fCategory): 8 Blocking factor of Reader (fReader): 12
Blocking factor of Book (fBook): 9 Blocking factor of Loan-record (fLoan-record): 10
Proportion of category names containing ‘science’: 1/7
Number of female readers: 24,950
Range of age: from 12 to 72
Values of grade-no: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Category, Reader and Book are sequential files in sorted order of their primary key
and Loan-record is a sequential file in sorted order of loan-date. Records do not span
across blocks.
CS5481 Data Engineering Assignment 2
Consider the following SQL query that retrieves all information about female readers whose
age is less than 30 with grade number above 6 and have loan records of any non-sciencerelated books. select * from Category c, Reader r, Book b, Loan-record l where c.category-name not like ’%science%’
and c.category-no=b.c-no and b.book-id=l.b-id
and r.reader-id=l.r-id and r.gender=’female’
and r.age<30 and r.grade-no>6;
(a) Write an unoptimized relational-algebra expression of the above SQL query that might
be initially generated from an SQL translator.
(b) Estimate the number of records the query returns. Explain your estimation.
(c) Assume that only 30 memory blocks are available and consider the cost of disk block
transfers (only). If left-deep join orders are considered, suggest the best evaluation plan for
the query. Particularly, you need to do the followings.
(i) Transform the unoptimized relational-algebra expression in Part (a) it into an
equivalent optimized relational-algebra expression for producing your suggested
evaluation plan. Show all the steps and state clearly the rule number of the equivalence
rule you used in each step.
(ii) Draw the fully annotated evaluation plan (including exactly what algorithm is used
for each operation and how the execution of the operations is coordinated).
(iii) Indicate in the evaluation plan the memory allocation (number of memory blocks
allocated to each input and output (bb), but the total cannot exceed the available
memory at one time).
(iv) Find the cost of each operation and the total cost of the whole plan in number of
disk block transfers.
(v) State clearly any justifiable assumptions you have made in your estimation.
(d) If any join orders can be considered, is it possible to have a better evaluation plan for the
query? If yes, you need to follow the steps in Part (c) to explain your answer.
(e) If a B+-tree index can be created on one attribute for one of the base relations, is it
possible to have a better evaluation plan for the query? If yes, which attribute and which
relation will you choose? Again, if yes, you need to follow the steps in Part (c) to explain
your answer.