The Abundance of Short Proteins in the Mammalian Proteome

Abstract
Short proteins play key roles in cell signalling and other processes, but their abundance in the mammalian proteome is unknown. Current catalogues of mammalian proteins exhibit an artefactual discontinuity at a length of 100 aa, so that protein abundance peaks just above this length and falls off sharply below it. To clarify the abundance of short proteins, we identify proteins in the FANTOM collection of mouse cDNAs by analysing synonymous and non-synonymous substitutions with the computer program CRITICA. This analysis confirms that there is no real discontinuity at length 100. Roughly 10% of mouse proteins are shorter than 100 aa, although the majority of these are variants of proteins longer than 100 aa. We identify many novel short proteins, including a “dark matter” subset containing ones that lack detectable homology to other known proteins. Translation assays confirm that some of these novel proteins can be translated and localised to the secretory pathway. Living things work by the actions of proteins and other molecules that are encoded in their genomes. Genome projects aim to construct a “parts list” of life by cataloguing all of these molecules. However, some molecules are harder to identify than others. One difficult category is short proteins. This is because protein-coding nucleotide sequences have statistical features that, for large proteins, are very pronounced and readily distinguishable from non-coding sequence, but for small proteins are not. Thus, to avoid erroneous predictions, many genome projects have employed an artificial length threshold of 100 amino acids. Hence, short proteins are underrepresented in protein catalogues, although they are known to play important roles in immunity, cell signalling, and metabolism. This study clarifies the abundance of short proteins by exploiting two advantageous resources. The first is the large FANTOM collection of mouse transcript sequences: it is much easier to identify proteins in mature transcripts than in raw genome sequence. The second is a method to identify protein-coding sequences by examining how they differ between organisms. The redundancy of the genetic code implies that these differences will follow distinctive patterns, and this approach has the statistical power to break the 100-amino-acid barrier and reliably identify short proteins.