“Abundant bioactivity” of random DNA sequences?

This blog was written for the Nature Ecology and Evolution Community where it is posted here.

Probing the claims of a recent study

Readers of this blog will be aware of the recent Nature Ecology and Evolution paper entitled “Random sequences are an abundant source of bioactive RNAs or peptides”. Rafik Neme, the first author, posted an engaging Behind the Paper blog here.

On a quick look, I thought the study might be the beginnings of the solution to the mystery of orphan genes. (I posted about orphan genes here a few months ago.) The paper appears to demonstrate that an unexpectedly high percentage of random 150 base-pair DNA sequences are functional when expressed in E. coli. If true, this would suggest that de novo gene evolution could occur easily from junk DNA.

But on closer reading, I have concluded that it is not what the paper shows. I had to re-evaluate my initial understanding of the terms “abundant” and “bioactive” in the paper’s title.


Let’s start with the term “abundant”. In the paper’s abstract this is fleshed out as:

“Contrary to expectations, we find that random sequences with bioactivity are not rare. In our experiments we find that up to 25% of the evaluated clones enhance the growth rate of their cells and up to 52% inhibit growth.”

This is staggeringly high percentage of bioactive random sequences. Unless, of course, there is some kind of ascertainment bias that means that the “evaluated clones” were a non-random subset of all the random sequences. Are they?

The number of “evaluated” cloned random sequences per experiment was between 499 and 1061 (Table 1 of the paper). A close reading of the methods section shows that each experiment started with approximately 1,000,000 random sequences. Thus over 99% of the random clones present at the start of the experiment were not “evaluated”. The “evaluated” clones are a tiny subset.

But are they a biased subset? To be “evaluated” the clones had to fit the requirement of “occurrence of at least five times or more in any one replicate of an experiment”, as detected by Illumina sequencing. Large numbers of clones never reached this threshold. Thus the clones which were “evaluated” are (as far as I can see) by definition ones that reached higher frequencies than the majority of other clones. This looks to me like a strong ascertainment bias: the clones that were evaluated are the ones more likely to be beneficial to their carrier E. coli cells.

The 77% of these “evaluated” clones that appear to be bioactive are less than 0.1% of the clones that were present at the start of the experiment. That does not strike me as “abundant”.

One might argue that, as the authors found both apparently beneficial and apparently deleterious sequences in their “evaluated” sample, that the sample could not have been biased towards beneficial sequences. But in fact, the division between the deleterious and beneficial bioactivities is not that clear cut – which brings me to the second term I want to discuss.


Now for the term “bioactive”. It is a term that covers both beneficial and deleterious functionality. For the evolution of new genes, it is the beneficial changes that are of most interest, as natural selection can increase their frequency in populations. Deleterious random sequences are bad news, both for their host cells, and for the de novo gene evolution hypothesis. If deleterious functionality is abundant in random sequences, then pervasive low levels of transcription of junk DNA is unlikely to be a starting point for new gene evolution.

In the Neme et al. paper, the random sequences are expressed from vectors placed in E. coli cells, and the cells are competing against each other. The beneficial sequences are the ones whose host cells rise in frequency over the generations, and the deleterious ones are those which fall in frequency.

In a recent commentary on this paper in Current Biology, Caroline Weismann and Sean Eddy from Harvard point out that this division between beneficial and deleterious sequences is questionable. They write:

“…sequence enrichment does not mean that a sequence is beneficial relative to wildtype E. coli, only that it was better than other random sequence competitors. It could be that all the random sequences are deleterious to E. coli, but some are less deleterious than others, and these would rise to higher relative frequencies.”

One might add that, equally, all of the sequences could be beneficial, but some are more beneficial than others and out-compete them. We simply can’t tell from the changes in frequency alone.

Neme et al. sought to overcome this problem by choosing three clones that seemed to be beneficial and competing them only against E. coli with an empty vector in them (exactly how they selected these three clones is not clear, so it is hard to assess what ascertainment bias this may have involved). They found that “all [three] are better than the empty vector”.

Weismann and Eddy are not convinced that these random sequences are beneficial compared to wild-type E. coli. They argue that the vector that Neme et al. use to express the random sequences in the E. coli cells is in itself deleterious to the E. coli cells, even when it is empty. They make a case that the apparently beneficial random sequences are beneficial only because they reduce the deleterious activities of the vector that they are cloned into. They write:

“Because high-level expression of any exogenous plasmid-encoded sequence is detrimental to the E. coli host, under these conditions a beneficial random sequence could include anything that decreases RNA or protein expression levels relative to the vector without insert, for instance by base-pairing complementarity to the translation initiation site. Indeed, all three beneficial clones seem to show strongly reduced protein expression relative to the population average of the library.”


Putting it all together, I hypothesise that the Neme et al. study has an intrinsic ascertainment bias that is selecting mainly for random sequences that ameliorate the harmful effects of their carrier vector. The experiment is tracking the dynamics of competition among these ameliorating sequences, which are a tiny subset of the random sequences present at the start of the experiment.

There is “bioactivity” here, but it isn’t “abundant”, and it probably works by reducing the deleterious nature of the expression vector.

Clearly there is a lot here that is interesting and needs to be followed up. I hope that Neme et al. get a large research grant that will allow this. It would be great to investigate more thoroughly the mechanism of the interactions between the random sequences, their vectors and their cells. My hypothesis about what is going on may well be wrong.

But as for orphan genes, the field still lies wide open. This experiment’s results do not show that random sequences can provide new genes de novo to enhance the function of wild-type cells, as I had naively thought it might.