The last lecture gave an introduction into protein sequences
(or primary structures), and we have learnt which information can be
extracted on the sequence level. In short, these include:
Go back to Contents
B. Multiple sequence alignment of insulin
As you may know, insulin is essential for normal metabolism,
as it stimulates glucose uptake after a meal. Malfunction of insulin
leads to diabetes, which is characterized by decreased glucose
tolerance resulting from a relative deficiency of insulin (or,
alternatively, a lack of sensitivity to insulin on the receptor side).
For a sequence analysis of insulin, we obviously first need its sequence.
For this, we visit the SWISS-PROT database, which can be accessed via
the ExPASy Proteomics
Server. Please open the ExPASy web page in another browser window
(right click -> open in new window in mozilla), follow the
"UnipProtKB" link (first link under "Popular resources"), and
search for "insulin". As you will see, there are hundreds of matches,
but none directly corresponds to human insulin. This is because on the
sequence level, insulin is stored as a precursor. After synthesis, the
precursor protein is then split into the A and B chain that together
form the active form of insulin. Thus, we select the sequence of the
human insulin precursor : INS_HUMAN (P01308).
The SWISS-PROT entry of human insulin starts with some general
information on the sequence, starting with basic entry information,
the name and origin of the protein, literature references connected
with this sequence, and some comments concerning the function. This
section is followed by a number of cross-references to other databases
concerning insulin, like for example the Protein Data Bank (PDB),
where protein structures are stored. As we can see, lots of structural
information on insulin is available. In the section entitled
"Features", we can learn how the precursor sequence is related to the
active form of the hormone. As can be seen, the first 24 residues are
a signal sequence, followed by a stretch of 30 residues (25-54) that
corresponds to the B chain of insulin, and residues 90-110 make up the
A chain of insulin. Also annotated in the SWISS-PROT entry are a
number of natural mutations, leading to different forms of diabetes.
Find the sequence of the insulin.
Retrieve the sequence in
the so-called "FASTA" format. This is the sequence as we will feed it
to BLAST. Prepare the sequence for copy-and-paste to the BLAST window by
selecting the sequence (without the top title line) with the mouse and
activate "copy" in the mozilla "Edit" menu.
Now we open a new browser window to run BLAST, which also can be accessed
from the ExPASy Proteomics
Server. On the left half of the screen, under "Resources A..Z", you'll find a "Blast" similarity search. Click on
BLAST, paste your sequence, and run BLAST with the default parameters.
After some time, we obtain the 100 database sequences closest to the human
insulin (precursor), sorted to their level of similarity. Select all sequences,
by clicking the button
"Select up to" and then the checkbox next to the last sequence. After that, from the drop-down menu next to the button
"Submit query", hit "Retrieve sequences (FASTA format)" and hit
We now get presented the raw sequences in FASTA format. Select all
sequences for copy-and-paste, and open a new browser window at the European Bioinformatics
Institute from which we will run the multiple alignment server.
Question: Using BLAST, the selected
sequences have already been aligned, to assess the similarity to our
target sequence. Why do we need to do another alignment?
Under the DNA and RNA Services site of the EBI, click on
"Clustal Omega". Paste your sequences in the window and run Clustal with
the default parameters. Depending on the load on the server, Clustal
will take a while to complete. The "Result Summary" window shows all the
pairwise alignments and their scores. Press the "Jalview"
button for an instructive, detailed view of the alignment. To focus on
conserved residues, under "Colour", click "by Conservation". This way,
those residues that are highly conserved get highlighted according
to the selected threshold.
Question: Which are the most conserved
residues? Why might these residues be conserved? All cysteine (C) residues
seem highly conserved. What might be the reason?
On the right, a picture of the insulin structure is shown, with the A
chain in yellow and the B chain in magenta. As can be seen, there are
two "bridges" connecting the A and the B chain, formed by Cysteine (C)
residues on both chains. This is an important structural feature of
insulin, strongly stabilising the structure. Therefore, it can be
easily understood that these C residues are among the strongly
conserved residues in the hormone. As is known from other structural
studies, residues interacting with the insulin receptor include:
the N-terminus of the A-chain (G-I-V-E), the C-terminus of the A-chain
(Y-C-N), and the C-terminus of the B-chain (G-F-F-Y), so also for
these residues there is a clear reason for their conservation. For the
other conserved residues, the reason for their conservation is less
clear, although their mutation has shown altered activity, hence
indicating a functional role.
Go back to Contents
C. Phylogenetic analysis of hemoglobin
Another application of multiple sequence analyses is the derivation of
evolutionary information, in particular the analysis of common ancestors
among different species, and their grouping (also known as taxonomy)
based on sequence similarity. This analysis is known as phylogenetic
analysis, and trees representing the sequence relationships are known
as phylogenetic trees.
In this course we will generate two phylogenetic trees, and compare
the results, to see if the mutational pattern in the one protein
(and the associated phylogenetic tree) is similar to that of the
other. For this we will take the alpha and beta chain of hemoglobin.
Hemoglobin is the universal oxygen transporter in nature. It takes up
oxygens in the lungs (or gills for fish) and transports it via the
blood in red blood cells to the brain, muscle, or another destination
in the body where oxygen is required. In fact, the reason why blood is
coloured red is because of the hemoglobin. Hemoglobin contains iron,
which in that particular state is colored red, not unlike rust.
Although part of the same protein, the two sequences of the alpha and
beta chain have evolved
independently, and hence, two separate phylogenetic trees can be constructed.
For the sequence retrieval, we follow the same procedure as we have done above for insulin, first for the human hemoglobin alpha chain (search for "human hemoglobin alpha"), and then for the hemoglobin beta chain. The sequence to select for the alpha chain is HBA_HUMAN (P69905), for the beta chain it is HBB_HUMAN (P68871).
In the BLAST search, search for "Eukaryota" only.
From the BLAST results, select only the sequences that start with HBA,
and HBB, respectively (i.e. no HBD or HBA2, etc); the easiest way to
achieve this is to select all
sequences and then to deselect the undesired sequences. Paste the sequences
in the ClustalW window and submit the job.
In the "Result summary" start Jalview. Under "Calculate" select
"Calculate tree" and "Average Distance Using % Identity".
A phylogenetic tree will be generated.
Keep the ClustalW
window for the alpha chain open and repeat the procedure for the beta
chain. By default, the results are presented in a "cladogram", which
enables easy comparison of the individual groups.
Click on the branches of the tree to invoke the color coding
of the species belonging to the same branches.
Go back to Contents
D. Optional exercises
Protein function typically results from modular structural features called Domains. Due to evolutionary shuffling these domains can conservatively exist in proteins that share functional aspects.
Domains can be used for classifying proteins into families that display structural and sequence similarities within conserved regions of the protein. Pfam is an online database of proteins that contains a large set of Multiple Sequence Alignments of protein domains.
Pfam consists of two parts:
Using the link provided to the Pfam website search for the keyword Q12809 using the Jump to option. Try to answer the following questions:
If time allows, build a phylogenetic tree of a very different protein
(like a ribosomal elongation factor or F1-ATPase) and compare the
result to that of hemoglobin.
Go back to Contents
Go back to Contents