Sequence similarity

In order to best choose numbers for coverage and identity you can do a number of things. One operation that is helpful is

PHLAWD seqquery runfile.phlawd

This will go through the first steps of calculating best hits from the blast and will report those distributions so you can see them. After this is done, you can look in the genename.seqquery file and it will have two columns that correspond to 1) the best identity score and 2) the coverage for that identity.

You can plot these in any software, but something like R works well.

You can see there is a cluster at the bottom that is low identity and low coverage, and then variable identity with high coverage at the top. You would then set your cutoffs to maybe

coverage = 0.2
identity = 0.2
Fork me on GitHub