What is the difference between homology, similarity and identity?

The term "homology" pertains to comparative studies. Homology indicates an ancient common origin and temporal evolution and refers to structural characteristics. In comparative anatomy, it is used to compare structures in different animal species. 

In comparative protein biochemistry, "homology" retains the original meaning of "having a common evolutionary origin" and is used to evolutionarily define two or more proteins by locating common structural characteristics and common spatial distribution of, for instance, beta strands, helices, and folds. Accordingly, homologous protein structures are defined by spatial analyses. Measuring structural homology involves computing the geometric–topological features of a space. One approach used togenerate and analyze three-dimensional (3D) protein structures is homology modeling (also called comparative modeling or knowledge-based modeling). Homology modeling works by finding similar sequences on the basis of the obvious fact that 3D similarity reflects 2D similarity. Nonetheless, it is important to note that homologous structures do not imply sequence similarity as a necessary condition.

Sequence identity is the amount of characters which match exactly between two different sequences. Hereby, gaps are not counted and the measurement is relational to the shorter of the two sequences.

This has the effect that sequence identity is not transitive, i.e. if sequence A=B and B=C then A is not necessarily equal C (in terms of the identity distance measure) :

A: AAGGCTT

B: AAGGC

C:AAGGCAT

Here identity(A,B)=100% (5 identical nucleotides / min(length(A),length(B))).

Identity(B,C)=100%, but identity(A,C)=85% ((6 identical nucleotides / 7)). So 100% identity does not mean two sequences are the same.

Sequence similarity is first of all a general description of a relationship but nevertheless its more or less common practice to define similarity as an optimal matching problem (for sequence alignments or unless defined otherwise). 

Hereby, the optimal matching algorithm finds the minimal number of edit operations (inserts, deletes, and substitutions) in order to transform the one sequence into an exact copy of the other sequence being aligned (edit distance). Using this, the percentage sequence similarity of the examples above are sim(A,B)=60%, sim(B,C)=60%, sim(A,C)=86% (semi-global, sim=1-(edit distance/unaligned length of the shorter sequence)). But there are other ways to define similarity between two objects (e.g. using tertiary strucure of proteins).

Souce: NovoPro    2018-03-01