Run options

Choosing a coverage and identity

In order to estimate homology, PHLAWD takes a set of sequences (GenBank) and compares then against a “known” set of genes that are denoted by the configuration file. PHLAWD currently uses two measures, one called identity and one called coverage. Identity is calculated by the score of seq a against seq b (Sab) over the score of b against b (Sbb). This is necessary because we are using a scoring matrix. Coverage is calculated as Sab over Saa. This gives the coverage of the potential sequence to the known sequence (so if you are looking for a gene and want to exclude good matches that are mostly spacer, coverage may help). These will often but not always be correlated.

To set them use

coverage = a number between 0 and 1
identity = a number between 0 and 1

Look at this page for ways to explore this data.

Excluding names from dataset

To exclude names from the final dataset and the search, you can provide a file that has a list of names (one per line) in the format the name would appear in NCBI. So for example

Lonicera japonica
Lonicera etrusca

Then in the configuration file you would designate

excludelistfile = nameoffile

You may include wildcards such as

*var.
*sp.
*f.
*aff.
* x

This will exclude any name that has var. in it (varieties, etc.)

Including names from dataset

To include only names from a file, just include those names in a file  (one per line).

listfile = nameoffile

Excluding gis from dataset

To exclude gis from the final dataset and the search, you can provide a file that has a list of gis  (one per line).

excludegilistfile = nameoffile

Including gis from dataset

To include only gis from a file, just include those gis in a file  (one per line).

includegilistfile = nameoffile

Fork me on GitHub