[an error occurred while processing this directive]
Problem Set 2 Phylogeny |
Out: 30 January Due: 8 February (11am) |
Collaboration Policy - Read Carefully
For this assignment, you should do the first two parts (questions 1-5) on your own, and then meet with your assigned partner. Assigned partners will be emailed to the course list on Monday, January 30.
When you meet with your partner, you should first discuss your answers to the first two parts to arrive at a consensus best answer for each question. The consensus answer is the only answer you will turn in. Then, you should work as a team on the final part (questions 6-10). When you are working as a team, both partners should be actively involved all the time and you should take turns driving (who is typing at the keyboard).
You may consult any outside resources including books, papers, web sites and people, you wish for information on Python programming. Unlike Problem Set 1, you should feel free to conduct web searches or look at reference material on Sequence Alignment, Phylogeny, and related problems as you wish. You are also encouraged to discuss these problems with students in the class, including (but not limited to) your assigned partner.
You are strongly encouraged to take advantage of the staffed lab hours (which will be posted on the CS216 web site).
Purpose
LinkedList.py | ContinuousList.py | |||
Operation | Running Time | Memory | Running Time | Memory |
---|---|---|---|---|
length(self) | Θ(n) | Θ(1) | Θ(1) | Θ(1) |
__init__(self) | ||||
access(self,index) | ||||
append(self,value) | ||||
__str__(self) |
We can define a mutable abstract datatype with operations as follows (note that the first three operations are identical to those for the immutable list datatype from Lecture 3:
Consider our alignment code from PS1, excerpted below:
def bestAlignment (U, V, c, g): if len(U) == 0 or len(V) == 0: ... else: # try three possibilities: (U0, V0) = bestAlignment (U[1:], V[1:], c, g) ... (U1, V1) = bestAlignment (U, V[1:], c, g) ... (U2, V2) = bestAlignment (U[1:], V, c, g) ... # pick the best oneAlthough this is a clear way of finding the best alignment, as discussed in Lecture 4 it is very inefficient. So inefficient, that we cannot find alignments for non-trivial strings.
The modified code is found in DynAlign.py and shown below. The key changes are bolded:
def bestAlignment (U, V, c, g): def memoBestAlignment (U, V, c, g): def makeKey (U, V): return U + "%" + V if memo.has_key(makeKey (U,V)): res = memo[makeKey (U,V)] return res[0], res[1] if len(U) == 0 or len(V) == 0: while len(U) < len(V): U = U + GAP while len(V) < len(U): V = V + GAP resU = U resV = V else: # try with no gap (U0, V0) = memoBestAlignment (U[1:], V[1:], c, g) scoreNoGap = goodnessScore (U0, V0, c, g) if U[0] == V[0]: scoreNoGap += c # try inserting a gap in U (no match for V[0]) (U1, V1) = memoBestAlignment (U, V[1:], c, g) scoreGapU = goodnessScore (U1, V1, c, g) - g # try inserting a gap in V (no match for U[0]) (U2, V2) = memoBestAlignment (U[1:], V, c, g) scoreGapV = goodnessScore (U2, V2, c, g) - g if scoreNoGap >= scoreGapU and scoreNoGap >= scoreGapV: resU = U[0] + U0 resV = V[0] + V0 elif scoreGapU >= scoreGapV: resU = GAP + U1 resV = V[0] + V1 else: resU = U[0] + U2 resV = GAP + V2 memo[makeKey(U,V)] = [resU, resV] return resU, resV memo = {} return memoBestAlignment (U, V, c, g)
The Tree of Life project is developing a phylogeny for organisms on Earth. If you are unsure of your place in the univere, try staring from Life on Earth and walking down the tree to find Homo sapiens.
The way biologists (or linguists) determine evolutionary relationships is to look for similarities and differences between species (or languages). This is done by identifying a set of features that describe properties of a species or language. For species, the features might be phenotypic properties (e.g., do organisms have wings or gills?) or genotypic properties (the DNA sequence). Genotypic properties are likely to produce more accurate results, since small changes in genomes can produce large phenotypic changes. Note that this is a historical study. It can rarely provide definitive proof of a particular relationship, but a preponderance of evidence can make one explanation appear to be the most likely.
If two species have similar genomes, it is likely they evolved from a relatively recent comon ancestor. Biologists measure the similarity of genomes based on the number and likelihood of different kinds of mutations — base pairs may be inserted, deleted, duplicated, moved, or substituted. The number of mutations necessary to match two genomes gives an indication of the likelihood that the species evolved from a common ancestor. For this assignment we will assume a very simple model: the only mutation is a substitution or a single base pair and all substitutions are equally likely.
One measure of which tree is the most likely to represent the actual evolution of a set of species is parsimony. The parsimony principle is that if there are two possible explanations for an observed phenomenon, the simpler explanation is most likely to be correct. In producing phylogenetic trees, parsimony means we should look for the tree that requires the fewest possible total number of mutations. The goodness scores of the best possible alignments of two nucleotide sequences are one way of measuring how related they are. So, our goal is to construct a tree that maximizes the total goodness score of all connected pairs.
For example, consider the set of species described by the genomes below (of course, these are not their real genomes!):
Species | Sequence |
---|---|
Cat | catcat |
Dog | gggggg |
Feline | cccccc |
Tiger | cccaat |
The goodness scores of the possible pairs (using the c=10, g=2 goodness metric from PS1) :
Species | Cat | Dog | Feline | Tiger |
---|---|---|---|---|
Cat | - | 0 | 20 | 36 |
Dog | 0 | - | 0 | 0 |
Feline | 20 | 0 | - | 30 |
Tiger | 36 | 0 | 30 | - |
Note that our goodness score metric is symmetric (that is goodness(a,b) = goodness(b,a)).
Our goal is to find likely evolutionary relationships among the species by maximizing the sum of the goodness scores of all direct relationships. For example, consider the tree:
Cat / \ / \ Tiger Dog / / FelineThe total goodness score is goodness(Cat, Tiger) + goodness (Cat, Dog) + goodness (Dog, Feline) = 36.
This is a less likely phylogeny than,
Feline / \ / \ Dog Tiger / / Catwhich has a total goodness score of 66. Other trees have the same score, but no tree has a higher score.
For the remaining questions on this assignment, and most of Problem Set 3, you will explore algorithms and data structures in the context of finding phylogenetic trees. Note that we have greatly simplified the actual problem of determining biological evolutionary relationships. In fact, many species evolved from common ancestors which are now extinct. So, a more realistic phlogeny program would need to insert additional nodes to find a likely tree.
Cat Tiger Dog Felineand
Feline Dog Tiger CatNote that we do not need to distinquish between the left and right child when a tree has only one child.
def children(self): if not self.__left == None: yield self.__left if not self.__right == None: yield self.__rightIt will yield the left child (if there is one) the first iteration through the loop, and the right child (if there is one) the second iteration. When the generator exits, there are no more values to yield and the calling loop terminates. A client uses it like this,
childsum = 0 for child in tree.children(): childsum += child.getValue ()The generator defined below yields all possible two-part partitions of the input list:
def allPossiblePartitions (items): if len(items) == 1: yield [items[0]], [] yield [], [items[0]] else: for left, right in allPossiblePartitions (items[1:]): lplus = left[:] lplus.insert (0, items[0]) yield lplus, right rplus = right[:] rplus.insert (0, items[0]) yield left, rplus
for p1, p2 in allPossiblePartitions (s): print p1, p2Use n to represent the number of elements in s. You may assume print is O(1).
findTree ({'feline':'cccccc', 'cat':'catcat', \ 'tiger':'cccaat', 'dog':'gggggg'})should produce the all trees with maximal goodness score (66), including the tree above. (The number of trees is 60, if we count isomorphic trees where the trees would be identical if the left and right children are swapped. A better solution would remove these isomorphically equivalent trees (leaving 11 distinct trees), since there is no different meaning associated with the left and right children. It is acceptable for a "green star" level solution to this question to include trees that are isomorphically equivalent in your output.)
CS216: Program and Data Representation University of Virginia |
David Evans evans@cs.virginia.edu Using these Materials |