aaMI                  package:aaMI                  R Documentation

_M_u_t_u_a_l _I_n_f_o_r_m_a_t_i_o_n _f_o_r _a _P_r_o_t_e_i_n _S_e_q_u_e_n_c_e _A_l_i_g_n_m_e_n_t

_D_e_s_c_r_i_p_t_i_o_n:

     Calculate a matrix of pairwise mutual information values for a
     protein sequence alignment

_U_s_a_g_e:

     aaMI(file)

_A_r_g_u_m_e_n_t_s:

    file: a connection or character string giving the name of the file
          to load.

_D_e_t_a_i_l_s:

     This script calculates the mutual information between pairs of
     sites for a protein sequence alignment. The alignment must be in
     the form of a data frame with the sequence IDs as the row names
     and each site as a column.  The program begins by calculating the
     amino acid frequencies at each site. These frequencies are then
     used to calculate a vector containing the Shannon entropy H for
     each site. Shannon entropy is calculated using the equation

                  H_i = sum[i] (P(X_i)log2(P(X_i)))

     where   P(X_i) = frequency of amino acid X at site _i_ of the
     alignment. Next the program calculates the joint probabilities
     P(X_i,Z_j) of pairs of amino acids X and Z at sites _i_ and _j_.
     The Shannon entropy and joint probabilities are used to calculate
     the mutual information MI with the formula

        MI_ij = H_i + H_j -sum[i,j](P(X_i,Z_j)log2(P(X_i,Z_j))

_V_a_l_u_e:

     For the analysis of a protein sequence alignment data frame
     "file", the output is an NxN upper-triangular matrix, where N is
     the number of sites in the alignment. Values along the diagonal of
     the matrix are the entropy values (H) for each site.

_A_u_t_h_o_r(_s):

     Kurt Wollenberg

_R_e_f_e_r_e_n_c_e_s:

     Shannon, C. E. and W. Weaver. (1949) _The Mathematical Theory of
     Communication_, University of Illinois Press.

     Wollenberg, K. R. and W. R. Atchley. (2000) Separation of
     phylogenetic from functional associations in biological sequences
     by using the parametric bootstrap. _Proceedings of the National
     Academy of Science_ *97* 3288-3291.

_E_x_a_m_p_l_e_s:

     ## Read in a protein sequence alignment file, FastA format
     ## Not run: SeqDataFA <- read.FASTA("ProteinSeqFastA.txt")
     ## Read in a protein sequence alignment file, ClustalX .aln format
     ## Not run: SeqDataCX <- read.CX("ProteinSeq.aln")
     ## Read in a protein sequence alignment file, GeneDoc .msf format
     ## Not run: SeqDataGD <- read.Gdoc("ProteinSeq.msf")

     ## Calculate the mutual information matrix for one of these alignments.
     ## Not run: ProteinSeqmi <- aaMI(SeqDataGD)

