The discovery of phylogenetic trees as an example of the software engineering process.
DNA sequencing is a marvelous thing that, at it’s best, allows us to discover the entire genome of various lifeforms. A genome is a (very long) sequence of nuecleotides, commonly written as A, C, T, and G. I’m going to use these to try to outline how software development is more than just talking to a computer.
Suppose a biologist says to you, “DNA sequence is the ultimate answer for deciding what is most similar. Two plants may look similar, but look at their DNA, and you can see exactly how different they are.” (source). And you think to yourself, “Self, DNA sequences are millions of nucleotides long. Humans aren’t going to want to compare them by hand. They’re going to want computers to do it for them.” So you have a general problem in mind: given two DNA sequences, state how different they are.
But we aren’t done defining the problem yet,
not by a long shot.
A problem isn’t defined
until I can tell you if a particular example is a solution or not.
Some are easy: ACT
and ACT
have 0 differences.
But past that it gets messy;
how different from ACT
are
AAT
, AGT
, AT
, ACTT
,
TGA
, etc?
Two things seem pretty clear. First, we aren’t going to be able to answer these questions by ourselves; we’re going to need to ask a domain expert or possibly have someone design an experiment to give us the answer. Second, there had better be some generalizations possible or else we’re never going to finish defining the problem.
So we work with the biologist and come up with a decent set of changes that can happen to DNA and the relative likelihood of each. You can pretty well guarantee this process will have two problems. First, not everyone will agree on the set of operations; it’ll just be a pretty good guess, hopefully good enough to compare species similarity correctly. Second, there will be more and more complicated rules than you know how to handle. Some parts will be nice and clean, like “a nucleotide can randomly be replaced with a different nucleotide with probability X”; but some will be quite difficult to define fully, like “mutations in non- or low-functional areas of the sequence are more likely to propagate to later generations than are changes that control vital cellular functions.”
These problems are pretty typical. Tasks that people are interested in having solved almost always contain approximations, guesswork, and poorly-defined and/or really complicated pieces. So we’ll work with the biologist trying to make generalizations and approximations that are sufficient to make the problem crisp enough to program without going so far afield that it loses applicability to the original task.
The last DNA matching algorithm I saw only included individual nucleotides appearing, disappearing, or mutating, and had very general costs for each. I don’t know if the current algorithms are more complicated, but that one was hard enough given how long the sequences are…
Now that you have a well-defined problem you are left with two options. Either you can try to map your problem onto a problem some algorithms researcher solved earlier, or you can try to design your own algorithm from scratch.
So you start by searching around for string differencing algorithms and decided that the Levenshtein distance algorithm looks pretty good. But you also realize you don’t have time to run it to conclusion so you tweak it so it’ll pop out it’s best guess so far each time it gets closer to the real answer. There’re a few other unexpected obstacles when you find out about a few requirements they forgot to tell you about how they want things stored and how some of the genomes are really unknowns not A, C, T, or G, but finally you get it all designed out.
You write the program. It doesn’t work. You try to fix it. It kind of works. You think about starting over. Then your friend points out something obvious. You feel a bit embarrassed, but now it works great. Hooray!
Except it is even slower than you expected. You’re getting pretty bad approximations, and it’d take months of computing per DNA pairing to get good ones…. You try to tell the biologist it isn’t quite what they wanted, but they are delighted despite your objections because your output is so much better than what they had before.
Then comes the big part of the project. Someone discovers that for one (and only one) ridiculously-long genome (which seems just like every other genome to you) the computer crashes. You spend way too long working on it before you discover it is actually a problem with the way the input file was specified and not with the genome matching part of the program at all. You add some code that checks for that kind of mistake in the future.
The biologist asks if you can tweak the rules ever so slightly and at first you say “sure” but on closer inspection you realize that that change will mean Levenshtein distance no longer works because that algorithm relies on the one thing they want changed. You try to explain that even though it sounds small it is going to take a lot of effort, and the biologist is a bit upset but accepts the cost. So you start describing the solution anew.
This kind of thing continues until no one care to run the program anymore.
Looking for comments…