Practical 3: Introduction to protein structure


In the last lectures we learned how protein structures are determined experimentally. Today, we'll focus on how computational techniques are employed to aid structure determination.

There are two main techniques for solving protein structures: x-ray crystallography and Nuclear Magnetic Resonance (NMR). As can be seen from the current PDB holdings, more than 77,000 protein structures have been solved so far, and are available from the Protein Data Bank. About 12% of these are NMR structures, the rest are x-ray crystallographic structures. As can also be seen on the PDB holdings list, the number of solved proteins grows even faster, due to advancements in structure determination techniques. Nevertheless, the number of known protein sequences is almost an order of magnitude larger (currently about 533,000, as available from the UniProtKB/Swiss-Prot database).

A. Reading PDB files

Experimentally solved protein structures are stored at the Protein Data Bank, from which individual protein structures can be retrieved as so-called PDB files. Before we will turn to the structure determination itself, let us have a closer look at a typical PDB file, to see what can be learned about the background of the structure (experimental conditions etc.) and the structural quality (the resolution, coordinate uncertainty). We will focus on PDB entry 1DWR (an x-ray structure of myogloblin complexed with carbon monoxid) as it can be downloaded from the Protein Data bank.

PDB file format

The initial lines of a PDB entry contain information on:

Especially critical to check are the resolution and the R-factor (particularly the free-R factor), that contain information on how well the deposited structure matches the measured data (x-ray reflection intensities in this case).

HEADER    OXYGEN TRANSPORT                        11-DEC-99   1DWR              
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: MYOGLOBIN;                                                 
COMPND   3 CHAIN: A                                                             
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: EQUUS CABALLUS;                                 
SOURCE   3 ORGANISM_COMMON: HORSE;                                              
SOURCE   4 ORGANISM_TAXID: 9796;                                                
SOURCE   5 ORGAN: HEART                                                         
KEYWDS    OXYGEN TRANSPORT, RESPIRATORY PROTEIN                                 
EXPDTA    X-RAY DIFFRACTION                                                     
AUTHOR   2 I.SCHLICHTING                                                        
REVDAT   3   24-FEB-09 1DWR    1       VERSN                                    
REVDAT   2   29-APR-05 1DWR    1       REMARK HET    HETNAM FORMUL              
REVDAT   2 2                           HETATM                                   
REVDAT   1   03-MAR-00 1DWR    0                                                
JRNL        AUTH 2 J.BERENDZEN,I.SCHLICHTING                                    
JRNL        REF    NATURE                        V. 403   921 2000              
JRNL        REFN                   ISSN 0028-0836                               
JRNL        PMID   10706294                                                     
JRNL        DOI    10.1038/35002641                              
REMARK   2                                                                      
REMARK   2 RESOLUTION.    1.45 ANGSTROMS.                                       
REMARK   3                                                                      
REMARK   3 REFINEMENT.                                                          
REMARK   3   PROGRAM     : X-PLOR 3.851                                         
REMARK   3   AUTHORS     : BRUNGER                                              
REMARK   3                                                                      
REMARK   3  DATA USED IN REFINEMENT.                                            
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.45                           
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 20                             
REMARK   3   DATA CUTOFF            (SIGMA(F)) : 0.0                            
REMARK   3   DATA CUTOFF HIGH         (ABS(F)) : NULL                           
REMARK   3   DATA CUTOFF LOW          (ABS(F)) : NULL                           
REMARK   3   COMPLETENESS (WORKING+TEST)   (%) : 96.1                           
REMARK   3   NUMBER OF REFLECTIONS             : 23794               
REMARK   3  
REMARK   3  FIT TO DATA USED IN REFINEMENT.                                     
REMARK   3   CROSS-VALIDATION METHOD          : THROUGHOUT                      
REMARK   3   FREE R VALUE TEST SET SELECTION  : RANDOM                          
REMARK   3   R VALUE            (WORKING SET) : 0.211                           
REMARK   3   FREE R VALUE                     : 0.255                           
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.0                             
REMARK   3   FREE R VALUE TEST SET COUNT      : NULL                            
REMARK   3   ESTIMATED ERROR OF FREE R VALUE  : NULL                       
REMARK   3  RMS DEVIATIONS FROM IDEAL VALUES.                                   
REMARK   3   BOND LENGTHS                 (A) : 0.013                           
REMARK   3   BOND ANGLES            (DEGREES) : 1.93                            

Following the introductory material, some specific information regarding the protein and its crystalline form are provided, including:

FORMUL   6  HOH   *132(H2 O1)                                                   
HELIX    1   1 SER A    3  ASP A   20  1                                  18    
HELIX    2   2 ASP A   20  HIS A   36  1                                  17    
HELIX    3   3 HIS A   36  GLU A   41  1                                   6    
CRYST1   63.600   28.800   35.600  90.00 106.50  90.00 P 1 21 1      2                   

Then come the actual atom coordinates (or structure), whose listing takes up most of the average PDB file.
Each listing begins with "ATOM" and is followed by:

ATOM      1  N   GLY A   1      -2.316  16.963  14.230  1.00 20.85           N  
ATOM      2  CA  GLY A   1      -2.992  16.384  15.439  1.00 19.00           C  
ATOM      3  C   GLY A   1      -2.137  15.253  16.013  1.00 18.18           C  
ATOM      4  O   GLY A   1      -2.132  14.129  15.477  1.00 18.59           O  
ATOM      5  N   LEU A   2      -1.387  15.560  17.074  1.00 15.98           N  
ATOM      6  CA  LEU A   2      -0.561  14.582  17.790  1.00 13.52           C  
ATOM      7  C   LEU A   2      -1.065  14.383  19.213  1.00 12.59           C  

This goes on for a while, until the end of the peptide chain, which is marked by the "TER" line. If there are any other molecules that co-crystallized with the protein (such as solvent molecules or ligands) they are listed as "hetero-atoms" near the end of the file.

HETATM 1203 FE   HEM A 154      14.347  28.659   5.074  1.00  8.13          FE  
HETATM 1204  CHA HEM A 154      15.659  31.898   5.315  1.00  5.90           C  
HETATM 1205  CHB HEM A 154      13.490  28.753   8.433  1.00  6.10           C  
HETATM 1206  CHC HEM A 154      13.145  25.505   4.903  1.00  5.24           C  
HETATM 1207  CHD HEM A 154      15.262  28.616   1.824  1.00  6.55           C  

Go back to Contents

B. X-ray crystallography

The main technique for determining protein structures is x-ray crystallography. Since the first protein structure (myoglobin) was solved by this technique by John Kendrew and Max Perutz in the late fifties, several thousand others followed. As can be appreciated from the picture on the right, which shows John Kendrew with the structural model of myoglobin, at that time the determination of a structure the size of a protein, without the aid of a computer, was a formidable task.

It is important to note that in both x-ray crystallography and NMR, protein structures are not measured directly in the experiment. Rather, a set of data is collected (a diffraction pattern or a NMR spectrum), from which a model of the protein structure is derived. To appreciate the difference between data and structure, we'll now look at two different structures of the same protein, and the corresponding x-ray crystallographic data. For this, we will concentrate on the bacterial light driven proton pump bacteriorhodopsin. Click here for more background information on bR.

First download two bR structures from the Protein Data Bank, with PDB entries 1BRR and 1QHJ. Save both PDB files to your local account (see the last lecture if you forgot how to download from the PDB). View the structures with rasmol:

rasmol 1BRR.pdb
to focus only on one of the three protein molecules in the file type (in rasmol):
restrict *a
to see the main features of the structure:
color structure
To highlight the dye (the light sensor in the protein interior):
select ret and *a
Have a close look at the structure, before repeating the procedure for 1QHJ.pdb in a different window, such that you can compare the two structures.

Question: By looking at the structures, which of the two structures would you prefer, in terms of coordinate accuracy?

Remember, so far we only looked at the coordinates, which represent a model that was optimized against the actually measured data. So, let us now have a look at the data. In x-ray crystallography, data are collected by measuring a diffraction pattern that is obtained from x-rays reflected by a protein crystal. As mentioned in the lecture, this diffraction pattern itself does not suffice to determine the complete structure since only the amplitudes of the diffracted waves were collected, not their phases. In x-ray crystallography, however, there are a number of tricks available (e.g. isomorphous replacement, molecular replacement) but we will not go into that in detail here. What is important to remember is that eventually, an atomic electron density map is obtained.

Question: Why do primarily the electrons of a molecular sample contribute to the diffraction of x-rays? answer.

Visit the Uppsala electron density server to view the electron density map 1BRR. Enter the PDB code (1BRR), and wait for the summary page to load. Several plots with information on this structure are available. Feel free to browse around to check the meaning of the individual plots. At the bottom of the summary, the electron density viewer can be activated. Select the "Astex viewer" and click "Go". After a while, a java applet should appear with the electron density and the model structure visible. In the lower part of the window you'll see the sequence of the protein. With the mouse shift to around position 80 until you see the sequence fragment "WARYA", and click on the "Y". You should now see the six-membered aromatic ring of the tyrosine). Do you find the electron density for the aromatic ring convincing? Now shift the focus to residue S35 which is numbered 33 in the sequence, the "S" in the sequence "SDPDA". How is the fit between the model and the data here? To see the retinal, the light sensor in the center of the protein, go to the very end of the sequence.

Now repeat the procedure for entry 1QHJ. How is the fit for residue Y83 (numbered 79 here)? and for S35 (numbered 31 here)? And the retinal?

Question: Based on the data and on the model structures, would you say there is a large impact of the resolution of the data on the accuracy of the structural model?

Question: What ranges of resolution do you think belong to low, medium and high resolution structures. What are the typical structural features do you expect to be resolved, respectively. answer.

The highest resolution x-ray crystallographic structures have a resolution of approx. 0.8 Angstrom or even somewhat better. To see an example of such a dataset, look at the density for structure 2B97. Note that you can zoom into the map by clicking "Shift" on the keyboard and moving the mouse with the left button pressed. Do you recognize the difference in appearance?

Question: Although the resolution of this structure is rather high, at 0.75 Angstrom, the hydrogen atoms (e.g. on the side chains) are still difficult to see. Why is this?

A measure for the coordinate uncertainty of the individual atoms due to the thermal motion in the crystal is given by the temperature factor (or B factor).
Low B-factors (< 30) correspond to well-defined parts of the structure, whereas high B-factors (> 80) might indicate highly disordered parts of the structure or even mis-interpreted parts of the model.

Question: How do the temperature factors of a crystallographic structure in principle compare to the flexibilities of a protein in a MD simulation? answer.

Go back to Contents


The other main technique for determining protein structures is NMR. In contrast to x-ray crystallography, no crystals are required for an NMR experiment. Rather, the structure is determined of the protein in solution. Therefore, it has the advantage that the protein can be studied in its native environment. On the other hand, the resolution of an NMR structure is usually lower and there is a size limitation of a few hundred amino acids for structure determination using NMR.

It would go beyond the scope of this course to explain the NMR experiment in detail. We will therefore only briefly touch on the experimental setup and then focus on the structure building and refinement step based on the obtained data. The NMR signal is recorded as a nuclear magnetic resonance spectrum of predominantly the hydrogen atoms after the sample has been subjected to a (number of) strong magnetic pulse(s). Mainly hydrogen atoms give rise to the signal, because of the magnetic spin properties of the hydrogen nucleus (a proton). The naturally occurring isotopes of the other elements that are found in proteins, carbon (12C) and oxygen (16O), have a zero nuclear magnetic moment. Nitrogen (14N) does have a non-zero magnetic moment, but can usually not be used in NMR, for reasons that would go beyond the scope of this course to explain. These elements, therefore, can only be utilized in NMR experiments when chemically replaced by a specific isotope, like 13C or 15N. The most structurally relevant information is usually obtained from a so called NOESY experiment (Nuclear Overhauser Enhancement SpectroscopY). The Nuclear Overhauser Effect or Nuclear Overhauser Enhancement is the change (enhancement) of the signal intensity from a given nucleus as a result of exciting or saturating the resonance frequency of another nucleus. Since this effect is distance-dependent, it can be used to derive the distance between an interacting pair of protons. In practice, protons closer than 6A apart can be identified this way.

Now, we will calculate a model of the structure of a small protein, the B1 domain of protein G, from the proton-proton distance information obtained from a NOESY experiment. Download the data file containing the distance information here. You can have a look at the file (with the program "more" or "less" or a browser or editor of your choice) to assure yourself that there are indeed only distance bounds listed in this file. Additionally, we need an initial guess of the structure. Since we don't know the structure yet, we have to start from an unstructured peptide chain, which can be obtained here . Have a look at the structure with:

rasmol proteinG.pdb
Finally, we need something called a molecular topology, a chemical description of the protein: which atoms does the molecule contain, which atoms are covalently bonded to each other, etc. This molecular topology file is available here . Now, in principle, we have all the data to attempt to build a structure that is in agreement with all experimentally determined distances. The only thing we still need is an input file for the CNS program.

Note that, in contrast to x-ray crystallography, where a single structure is presented, to reflect the fact that the NMR experiment probes an ensemble of protein molecules in solution, an NMR structure is usually represented by an ensemble of structures, that all fulfill the NMR data.

For starting CNS a library libg2c is needed. After downloading libg2c.tgz create a folder for the library files

mkdir libg2c
extract the files
tar -C libg2c -xvf libg2c.tgz
cd libg2c
and put the library into your dynamically linked library path
export LD_LIBRARY_PATH=`pwd`

Now return to the directory where the protein structure, topology and CNS input files are

cd ../
and start CNS with
source /usr/global/cns/cns_solve_env
If you get an error try to source it again. This might solve the problem.
/usr/global/cns/intel-i686-linux_g77/bin/cns < anneal.inp
10 structures of the B1 domain of protein G will now be calculated by simulated annealing. This is a computationally intensive calculation, as the structure is slowly, dynamically transformed from the extended starting conformation to the real structure, by a slow-cooling simulation, also called simulated annealing. As the calculation is running, we can have a look at how exactly such a calculation proceeds, and how the final structure is generated from the starting guess. Download this file, open another shell, and open the just downloaded file in pymol:
pymol sa.pdb
on the white bar, in the top pymol window, type:
show cartoon
and then press the "play" button at the bottom right of the main pymol window, to see an animation of the simulated annealing structure calculation procedure. If the movie plays too fast, on the menu, under movie -> speed, choose a different speed.

When the CNS structure calculation has finished, switch back to that window and type:

cat anneala_*.pdb > anneal.pdb
to combine all ten generated structures into one file. View the result with:
rasmol anneal.pdb

Question: Which parts of the structure are well-defined, and which parts show more ambiguity?

There is also an x-ray structure available of the B1 domain of protein G, available under the PDB code 1PGB. Download it from the Protein Data Bank and compare it to the just calculated NMR structure.

Question: What are the main differences between the NMR and x-ray structures of the B1 domain of protein G? hint

Question: Which limitations do you think have NMR and x-ray crystallography, respectively? answer.

D. Optional

  • change the initial temperature of the annealing simulation in the file anneal.inp
  • change the final temperature of the annealing simulation in the file anneal.inp

    Question: How do you expect these settings to change the results?

    Go back to Contents

    Further references:

    Go back to Contents

    For questions or feedback please contact Bert de Groot /