A STATISTICAL ANALYSIS OF THE AMINO ACID COMPOSITIONS OF PROTEINS

Abstract
Univariate and multivariate analyses are presented which give a descriptive statistical look at the amino acid compositions of proteins. Means, expressed in mole percent, and standard deviations are calculated using a sample of 207 proteins. The distributions of six of the 18 amino acids (Asp, Val, Met, Phe, Ile, and Leu) appear somewhat normal (Gaussian), whereas the distributions of the other 12 amino acids are judged by a chi‐squared test and the coefficients of kurtosis and skewness to not be normal. Principal component analysis has revealed the relationships among the amino acids to be quite complex: for example, to account for 70% of the variability in the data an eight dimensional space is needed. Several pairs of amino acids appear to have significant correlation coefficients, but most of the correlations do not have an obvious rationale in terms of the structures of the amino acids. The following pairs of protein sets have mean amino acid compositions that differ significantly, although no one amino acid accounts for the overall difference in any case: (eukaryotic enzymes, eukaryotic nonenzymes), (eukaryotic enzymes, prokaryotic enzymes) and (nonmicrobial enzymes, microbial enzymes).