Choosing a coverage and identity
In order to estimate homology, PHLAWD takes a set of sequences (GenBank) and compares then against a “known” set of genes that are denoted by the configuration file. PHLAWD currently uses two measures, one called identity and one called coverage. Identity is calculated by the score of seq a against seq b (Sab) over the score of b against b (Sbb). This is necessary because we are using a scoring matrix. Coverage is calculated as Sab over Saa. This gives the coverage of the potential sequence to the known sequence (so if you are looking for a gene and want to exclude good matches that are mostly spacer, coverage may help). These will often but not always be correlated.
To set them use
coverage = a number between 0 and 1 identity = a number between 0 and 1
Look at this page for ways to explore this data.
Excluding names from dataset
To exclude names from the final dataset and the search, you can provide a file that has a list of names (one per line) in the format the name would appear in NCBI. So for example
Lonicera japonica Lonicera etrusca
Then in the configuration file you would designate
excludelistfile = nameoffile
You may include wildcards such as
*var. *sp. *f. *aff. * x
This will exclude any name that has var. in it (varieties, etc.)
Including names from dataset
To include only names from a file, just include those names in a file (one per line).
listfile = nameoffile
Excluding gis from dataset
To exclude gis from the final dataset and the search, you can provide a file that has a list of gis (one per line).
excludegilistfile = nameoffile
Including gis from dataset
To include only gis from a file, just include those gis in a file (one per line).
includegilistfile = nameoffile