The hunt for genes is more like that for Timbuctu than for El Dorado. The mappers soon found that genes are oases of sense in a desert of nonsense. At one time, it seemed scarcely worth sifting the sands between the genetic cities, but, in the end, the complete map was made mainly on the grounds that it was worth while as one never knows what might turn up. It reaffirmed one of the most misunderstood facts in science; that it is possible to solve most problems by throwing money at them.
The assault on the physical map is best compared to surveying a country with a six-inch ruler, starting at one end and driving on to the opposite frontier. Twenty and more years ago, when the job began, one person could do about five thousand DNA bases a year. Now, it is routine to do thousands of times as many. Much of the intellectual effort of the job has moved from the simple accumulation of information to understanding it. Computer wizardry has played as important a part in the gene map as has biochemical machinery.
Once a segment of DNA has been sequenced, the local maps – the town plans – must be put in the right order. One way to build up a larger chart is to make a series of overlapping sequences of short pieces of DNA. The approach is a little like putting pages ripped out of a street guide back together by looking at the overlaps at the edge of each page in an attempt to find streets which run into each other. Sophisticated programs look for superimposed segments, long or short, and reassemble the torn fragments of DNA. That is much harder than it seems. An alphabet of just four letters and – like the map of an American city – many repeats of the same pattern of streets, gives plenty of chances for confusion. There are some short cuts. One trick, useful in the early days, was to jump several pages in the guide in the hope of missing out particularly tedious parts of the neighbourhood but for completion even the dullest parts of town must be charted.
New and powerful computers have made it possible, in principle at least, to make a whole genetic atlas at once, rather than piecing it together page by page. The ‘random shotgun’ approach lives up to its name. It blasts copies of the genome into thousands of segments, again and again, and, like a taxidermist rebuilding a single pheasant from the casual slaughter of many by a blind man with a twelve-bore, reconstitutes the whole thing from scratch. A giant program puts all the shattered pieces together, until at last they look like a map (or a game-bird). That approach worked well in fruit-flies, whose genome was sequenced before that of our own, but flies have a tenth as many DNA letters and far less repetition of easily-confused short sequences than we do. The less audacious ‘clone by clone’ approach takes tiny fragments (each about a twenty-thousandth of the whole of human DNA) and sequences them one by one. Then, it reassembles short segments of genes and, in time, re-forms the whole atlas. The approach, plodding as it may be, has worked well with humans and was used by the publicly-funded mappers to publish each clone as it appeared and to help thwart the privatised plan to sequence (and patent) the whole of our DNA at one fell swoop.
The physical map does not look at all like the linkage maps which emerged from family studies. The central difficulty is one of scale. A few tens of thousands of functional genes fit into three thousand million DNA letters. As most genes use only the information coded into several thousand bases there seems to be far more DNA than is needed. Mapping shows that just one part in twenty represents part of a gene. Our genome has an extraordinary and quite unexpected structure.
A geographical analogy may help. Imagine the journey along the whole of your own DNA as a trip from Land’s End to John o’Groat’s via London; about a thousand miles altogether. To fit in all the DNA letters into a road map on this scale, there have to be fifty DNA bases per inch, or about three million per mile. The journey passes through twenty-three counties of different sizes. These administrative divisions, conveniently enough, are the same in number as the twenty-three chromosomes into which human DNA is packaged. With the exception of some short segments a few hundred yards long which, for various technical reasons, have proved recalcitrant, the whole lot has been mapped out with an accuracy of one part in fifty thousand – an inch in a mile (which is as good or better than the maps sold by the Ordnance Survey).
The scenery for most of the trip is tedious. Like much of modern Britain it seems to be unproductive. About a third of the whole distance is covered by repeats of the same message. Fifty miles, more or less, is filled with words of five, six or more letters, repeated next to each other. Many are palindromes. They read the same backwards as forwards, like the obituary of Ferdinand de Lesseps – ‘A man, a plan, a canal: Panama!’ Some of these ‘tandem repeats’ are scattered in blocks all over the genome. The position and length of each block varies from person to person. The famous ‘genetic fingerprints’, the unique inherited signature used in forensic work, depend on variation in the number and position of such segments. Other repeated sequences involve just the two letters, C and A, multiplied thousands of times while yet more are remnants of ancient viruses. Large sections of the genome are given over to long and complicated messages that seem to say nothing.
It is dangerous to dismiss all this DNA as useless because we do not understand what it says. The Chinese term ‘Shi’ can – apparently – have seventy-three different meanings depending on how it is pronounced. It is possible to construct a sentence such as ‘The master is fond of licking lion spittle’ just by using ‘Shi’ again and again. This would seem like empty repetition to those who cannot speak Chinese.
Much of the inherited landscape is littered with the corpses of abandoned genes, sometimes the same one again and again. The DNA sequences of these ‘pseudogenes’ look rather like that of their functional relatives, but are riddled with decay and no longer make anything. At some time in their history a crucial part of the machinery was damaged. Since then they have been rusting. Oddly enough, the same pseudogenes may turn up at several points along the journey.
After many miles of dull and repetitive DNA terrain, we begin to see places where some product is made. These are the functional genes. They, too, have some surprises in their structure. Each can be recognised by the order of the letters in the DNA alphabet, which start to read in words of three letters written in the genetic code, as a hint that it could produce a protein. In most cases there are few clues about what its product does, although its structure can be deduced (and its shape inferred) from the order of its DNA letters.
Most genes are arranged in groups that make related products, with about a thousand of these ‘gene families’ altogether. One is involved in the manufacture of the red pigment of the blood. Most of the DNA in the bone-marrow cells which produce the red cells of the blood is switched off. One small group of genes is hard at work. As a result they are better known than any other. Much of human molecular biology grew from research on this particular genetic industrial centre, the globin genes.
They have two factories. One is halfway along the genetic road to John o’Groat’s – in Leeds. It makes one part of the protein involved in carrying oxygen. The beta-globin industrial estate contains about half a dozen sections of DNA that code for related things. That responsible for part of adult haemoglobin (and involved, when it goes wrong, in sickle-cell disease) is quite small: about three feet long on this map’s scale. A few feet away is another one which makes a globin found in the embryo. Close to that is the decayed hulk of some equipment which stopped working years ago. The beta-globin factory covers about a hundred feet altogether, most of which seems to he unused space between functional genes. It co-operates with a sister estate, the alpha-globin unit, a long way away, (near London, on this mythical map) which produces a related protein. When joined together, the two products make the red blood pigment itself. Most genes are arranged in families, either close together or scattered all over the genome.
The map of ourselves shows that genes are of very different size, from about five hundred letters long to more than two million. One makes the largest known protein, titin, a molecular shock-absorber; a long, pleated structure found in muscles, in blood cells and in chromosomes. Whatever the size of its product, titin is by no means the largest gene. Most human genes have their functional segments interrupted by