Genomics is not source-code

A lot of the squee I've heard about the sequencing of the human genome and the ever dropping cost to completely sequence a single genome has been in the nature of "we're figuring out how nature programs biology!" This is true to a point, but the reality of programming new life or new functions of life is far, far in the future. Yes, we've already created artificial life, but it wasn't done with full understanding of the source code we used; we took the code that governs functions we wanted and fitted them together.

Biology is in some ways like computers in that there is a (presumably) deterministic process that governs the rules of how it works. It exists in a fundamentally chaotic environment, which makes extracting that determinism pretty hard. But we're sure there is a causal chain for most anything, if only we look hard enough. For computers we know it all end to end, we wrote the things so we should, but are only now getting to the levels of complexity in these systems where they can mimic non-deterministic behavior. But if we dig down into the failure analysis we can isolate root and associated causes of the failure chain. We want to do that with biology.

We are far, far away from doing that.

Biology up until the genomics 'revolution' has been in large part describing the function of things. Our ability to stick probes in places has improved over time, which in turn has increased our understanding of how biology interacts with the environment at large. We've even done large scale changes to organisms to see how they behave under faulty conditions, just so we can better figure out how they work. Classic reverse-engineering, in other words. You'd think having access to the source code would make it go much easier. But... not really.

Lets take an example, a hand-held GPS unit. This relatively simple device should be easy to reverse engineer. It has a simple function, provide precise location. It has some ancillary functions such as provide accurate time, and give a map of the surroundings. Ok.

After detailed analysis of this device we can derive many things:

  • It uses radio waves of a specific wavelength set to receive signals.
  • Those signals are broadcast by a constellation of satellites, and it has to receive signal from no less than three of them before it can do so.
  • The time provided is very stable, though if it doesn't receive signals from the satellites it will drift at a mostly predictable rate.
  • The bits that receive the satellite signal, since it doesn't work if they're removed.
  • Where the maps are stored, since removal of that bit causes it to not have any.
  • A whole variety of ways to electrically break the gizmo.
  • How it seems to work electrically.

Additionally, we can infer a few more things

  • The probable orbits of the satellites themselves.
  • The math used to generate position.
  • The existence of an authoritative time-source.

Nifty stuff. What does the equivalent of 'genomics' give us? It gives us the raw machine code that runs the device itself. Keep in mind that we also don't know what each instruction does, and don't yet have high confidence in our ability to discriminate between instructions. And most importantly, we don't know the features of the instruction-set architecture. There is a LOT more work to do before we can make the top-level functional analysis meet up with the bottom-level instructional analysis. Once we do join up, we should be able to understand how it fundamentally works.

But in the mean time we have to reverse-engineer the ISA, the processor architecture itself, the signal processing algorithms (which may be very different than we inferred with the functional analysis), how the device tolerates transient variability in the environment, how it uses data storage, and other such interesting things. There is a LOT of work ahead.

Biology is a lot harder, in no small part because it has built up over billions of years and the same kinds of problems have been solved any number of ways. What's more, there is enough error tolerance in the system that you have to do a lot of correlational work before you can identify what's signal and what's noise. Environment also plays a very key role, which is most vexing as environment is fundamentally chaotic and can not be 100% controlled for.

We're learning that a significant part of our genome is dedicated to surviving faulty instructions in our genetic code, and we hadn't realized they were there before. We're learning ever more interesting ways that faults can change the effect of code. We're learning that the mechanics we had presumed existed for code implementation are in fact wrong in small but significant ways. The work continues.

We may have the machine-code of life, but it is not broken down into handy functions like CreateRetina(). Something like that would be source-code, and is far more useful to us systemizing hominids. We may get there, but we're not even close yet.