aaMIn                  package:aaMI                  R Documentation

_N_o_r_m_a_l_i_z_e_d _M_u_t_u_a_l _I_n_f_o_r_m_a_t_i_o_n _f_o_r _a _P_r_o_t_e_i_n _S_e_q_u_e_n_c_e _A_l_i_g_n_m_e_n_t

_D_e_s_c_r_i_p_t_i_o_n:

     Calculate a matrix of pairwise normalized mutual information
     values for a protein sequence alignment

_U_s_a_g_e:

     aaMIn(file)

_A_r_g_u_m_e_n_t_s:

    file: a connection or character string giving the name of the file
          to load.

_D_e_t_a_i_l_s:

     This script calculates the normalized mutual information between
     pairs of sites for a protein sequence alignment. The normalization
     constant used is the joint entropy, as described by Gloor, et al.
     (2005). The alignment must be in the form of a data frame with the
     sequence IDs as the row names and each site as a column.  The
     program begins by calculating the amino acid frequencies at each
     site. These frequencies are then used to calculate a vector
     containing the Shannon entropy H for each site. Shannon entropy is
     calculated using the equation

                  H_i = sum[i] (P(X_i)log2(P(X_i)))

     where   P(X_i) = frequency of amino acid X at site _i_ of the
     alignment. Next the program calculates the joint probabilities
     P(X_i,Z_j) of pairs of amino acids X and Z at sites _i_ and _j_.
     The  joint probabilities are used to calculate the joint entropy
     with the formula

                 sum_{i,j}(P(X_i,Z_j)log2(P(X_i,Z_j))


     Shannon entropy and joint entropy are used to calculate the
     normalized mutual information MI with the formula

                 MIn_ij = (H_i + H_j - JH_ij) / JH_ij

_V_a_l_u_e:

     For the analysis of a protein sequence alignment data frame
     "file", the output is an NxN upper-triangular matrix, where N is
     the number of sites in the alignment. Values along the diagonal of
     the matrix are the entropy values (H) for each site.

_A_u_t_h_o_r(_s):

     Kurt Wollenberg

_R_e_f_e_r_e_n_c_e_s:

     Gloor, G. B, L. C. Martin, L. M. Wahl, and S. D. Dunn. (2005)
     Mutual information in protein multiple sequence alignments reveals
     two classes of coevolving positions. _Biochemistry_ *44*
     7156-7165.

     Shannon, C. E. and W. Weaver. (1949) _The Mathematical Theory of
     Communication_, University of Illinois Press.

     Wollenberg, K. R. and W. R. Atchley. (2000) Separation of
     phylogenetic from functional associations in biological sequences
     by using the parametric bootstrap. _Proceedings of the National
     Academy of Science_ *97* 3288-3291.

_E_x_a_m_p_l_e_s:

     ## Read in a protein sequence alignment file, FastA format
     ## Not run: SeqDataFA <- read.FASTA("ProteinSeqFastA.txt")
     ## Read in a protein sequence alignment file, ClustalX .aln format
     ## Not run: SeqDataCX <- read.CX("ProteinSeq.aln")
     ## Read in a protein sequence alignment file, GeneDoc .msf format
     ## Not run: SeqDataGD <- read.Gdoc("ProteinSeq.msf")

     ## Calculate the mutual information matrix for one of these alignments.
     ## Not run: ProteinSeqminorm <- aaMIn(SeqDataGD)

