University of Virginia Computer Science CS216: Program and Data Representation, Spring 2006 |
06 February 2007 |
Problem Set 1 Sequence Alignment |
Out: 18 January Due: Now due Monday, 30 January (11am) due to book shortage |
Collaboration Policy - Read Carefully
For this assignment, you should do the first two parts (questions 1-6) on your own, and then meet with your assigned partner. You will receive an email shortly after you submit your registration survey that will identify your partner for this assignment.
When you meet with your partner, you should first discuss your answers to the first two parts to arrive at a consensus best answer for each question. The consensus answer is the only answer you will turn in. Then, you should work as a team on the final part (questions 7-10). When you are working as a team, both partners should be actively involved all the time and you should take turns driving (who is typing at the keyboard). Both teammates can (optionally) provide their own answers to question 11.
You may consult any outside resources including books, papers, web sites and people, you wish for information on Python programming. You should not, however, conduct web searches or look at reference material on Sequence Alignment or related problems. The point of this assignment is to get you thinking on your own and with your partner about good solutions to this problem; it would defeat that purpose if you spent your time looking up well known good solutions instead of thinking for yourself.
You are strongly encouraged to take advantage of the staffed lab hours (which will be posted on the CS216 web site).
Purpose
These questions ask you to prove properties of the factorial function, n!, defined by this recurrance relation:
2. Prove that n! Ω (2n)
3. Prove that nα o (cn) for any α > 0 and c > 1. (This is Theorem 3 on page 22 of the textbook.)
It is up to you to determine the experiments to do. We have provided a Python class Timer.py that you may find useful. It provides methods start and stop for starting and stoping a timer, and elapsed that returns the elapsed time between the start and stop calls (in seconds).
For example, this code measures the time it takes to perform 100000 list append operations:
import Timer timer = Timer.Timer () alist = [] timer.start () for i in range (100000) alist.append (i) timer.stop () print "Time: %2.6f" % (timer.elapsed ())
5. Determine the asymptotic complexity of the list insert operation. The time of an insert operation may be affected by two properties of the input: the size of the list, n and the location index where the new value is inserted, l. Express your answer using order notation, as precisely as possible. Your answer should include an explanation of the experiments you did, as well as the Python code and the results of your experiments.
6. Briefly speculate on how Python implements lists and the list operations. Can you determine if Python lists are implemented using a continuous representation or a linked representation? To answer this well, you may need to experiment with other list operations such as access (lst[i]) and selection (lst[1:] or lst[a:b]).
Sequence alignment is an important component of many genome analyses. For example, it is used to determine which sequences are likely to have come from a common ancestor and to construct phylogenetic trees that explain the most likely evolutionary relationships among a group of genes, species, or organisms. As genomes evolve, they mutate. Sequences can be altered by substitution (one base is replaced by another base), insertions (some new bases are inserted in the sequence), or deletions (some bases are deleted from the sequence).
To identify a good sequence alignment, we need a way of measuring how well an arrangement is aligned. This is done using a goodness metric, that takes as input the two aligned sequences and calculates a score that indicates how well aligned they are. The goodness metric we will use is:
The values of c and g are constants choosen to reflect the relative likelihood of a point mutation and insertion or deletion. For these examples, we will use values of c = 10 and g = 2. Note that the values selected will affect which alignment is best.
Example. Consider the catish sequences:
catcatggaa ||| catagcatgg
cat--catggaa ||| ||||| catagcatgg--
We can do this by recursively trying all three possibilities at each position:
Hence, we can find the best allignment of sequences U and V using this Python code (found in Align.py:
def bestAlignment (U, V, c, g): if len(U) == 0 or len(V) == 0: while len(U) < len(V): U = U + GAP while len(V) < len(U): V = V + GAP return U, V else: # try with no gap (U0, V0) = bestAlignment (U[1:], V[1:], c, g) scoreNoGap = goodnessScore (U0, V0, c, g) if U[0] == V[0]: scoreNoGap += c # try inserting a gap in U (no match for V[0]) (U1, V1) = bestAlignment (U, V[1:], c, g) scoreGapU = goodnessScore (U1, V1, c, g) - g # try inserting a gap in V (no match for U[0]) (U2, V2) = bestAlignment (U[1:], V, c, g) scoreGapV = goodnessScore (U2, V2, c, g) - g if scoreNoGap >= scoreGapU and scoreNoGap >= scoreGapV: return U[0] + U0, V[0] + V0 elif scoreGapU >= scoreGapV: return GAP + U1, V[0] + V1 else: return U[0] + U2, GAP + V2The procedure bestAlignment in Align.py implements this algorithm.
8. Determine analytically the asymptotic time of the bestAlignment procedure. Include a convincing argument supporting your answer and be sure to clearly define all variables you use in your answer.
9. A typical gene is a few thousand bases long. Predict how long it would take your bestAlignment procedure to align two 1000-base pair human and mouse genes.
10. Suggest approaches for improving the performance of bestAlignment and predict how they would affect your answers to 8 and 9. (You do not need to implement them, although we will be especially impressed if you do.)
CS216: Program and Data Representation University of Virginia |
David Evans evans@cs.virginia.edu Using these Materials |