In my post about creating a DNA font, I mentioned the idea of doing a similar thing for proteins. There are 20 standard amino acids, each with a single character abbreviation. When combined with symbols for ambiguous residues, any combination of letters can be interpreted as a polypeptide, or vice versa. While I haven’t made a peptide font (although Kristian Rother’s Twenty Characters would make a beautiful font), I have been playing around with what structures various words would make.
What English words appear in the PDB? To answer this question, I created a script to search the sequences of all proteins in the PDB for various words. As an aside, quickly searching for ~144,000 words across ~280,000 protein chains is an interesting computational challenge, one with an elegant algorithmic solution which reduced my script time from hours to seconds. Here are some of the longer or otherwise interesting words I found.
The longest words, at eight letters:
If you’re interested in searching for your own words, instructions and code are available from my github repository.