Advanced searches using SQL

The bleeding edge version of PHLAWD (available by compiling the code from the PHLAWD github repo) now has the ability to accept SQL strings through the search parameter in the config file. To activate this functionality, you also need to issue the “searchliteral” keyword on a separate line in the config file. If this keyword is present, PHLAWD will just tack the value of the “search” parameter onto the SQL query it builds internally (so, you don’t want to feed it a complete SQL query, just the part it uses for matching potential input sequences).

This allows a user to perform very precise, very powerful queries to the database, which can help avoid troublesome sequences, reduce the number of input sequences that need to be cross-checked against the guide sequences, and limit the overlap of matched sequences with flanking regions.

For those who know nothing about the relational database language called SQL, I will provide some examples below. If you are already familiar with SQL, here are some important highlights about passing SQL strings to PHLAWD:

  1.  You should (probably) only search the “description” field; this is the default search behavior and incorporating other fields may yield unexpected results.
  2. Don’t forget about order of operations for the AND and OR operators (AND has higher priority).
  3. Only use single quotes.
  4. Don’t break your SQL query into multiple lines; PHLAWD reads config files on a line-by-line basis.

Here is an example line from a config file using the searchliteral option:

search = (description LIKE '%ITS%' OR description LIKE '%internal%' OR description LIKE '%5.8S%') AND NOT (description LIKE '%18S%' OR description LIKE '%26S%')

For those new to SQL, here are some tips:

The example SQL string above will match any sequence whose description contains any of the words “ITS”, “internal”, or “5.8S”, unless it also contains either of the words “18S” or “26S”. The searching is not case-sensitive.

The “LIKE” keyword used in the SQL string above is a matching operator—you put the fieldname before it and the search string after it. Notice how whenever you see a “LIKE” in the example, it is preceded by the word “description”? Writing “description LIKE ‘[search string]'” into your query will find sequences whose description field matches the value of [search string]. You need to wrap the value of [search string] in single quotes so that the SQL parser knows it is a string. If you leave out the quotes things may behave strangely. Do not use double quotes, since this could cause strange behavior in the string parsers within PHLAWD.

The database that contains the sequence data stores information in a number of fields, which represent different properties or metadata associated with the sequences. For the purposes of finding sequences, it is wise to limit your searches to the “description” field, unless you know exactly what you are doing. This is the default behavior for PHLAWD and it is the safest.

When using the LIKE operator, the “%” character can be used as a wildcard in the search string; it will match any number of unspecified characters. If you don’t add the % (or some other wildcard) before and after the word you are searching for, phlawd will look for sequences whose description exactly matches that word. For instance, the query string “description LIKE ‘its'” will return 0 results, because there is no sequence in GenBank whose description is exactly equal to “its”. However, the string “description LIKE ‘%its%’ will return tens of thousands of sequences for plants alone.

There are some subtleties to using wildcards in SQL. For more information, read about SQL wildcards here.

The AND operator has higher precedence than the OR operator, so you need to use parentheses to indicate the correct order of operations or your query will not return what you think it is supposed to. The parentheses above wrap the sets of OR’s.

Here are a couple of example lines for the PHLAWD config file; these are trivial examples but illustrate the general approach that will be relevant for most people:

search = description LIKE '%rbcL%' AND NOT description LIKE '%matK%'
search = description LIKE '%rbcL%' OR description LIKE '%ribulose%'

Here are some more complex ones that begin to show how using SQL searches can really improve the specificity of PHLAWD results:

# for accD-psaI spacer; exclude sequences from flanking gene regions
search = (description LIKE '%accD%' AND description LIKE '%psaI%') AND NOT (description LIKE '%ycf4%' OR description LIKE '%rbcL%')

# for entire ndhF-rpl32-trnL region; exclude flanking regions
search = (description LIKE '%ndhF%' AND description LIKE '%rpl32%' AND description LIKE '%trnL%') AND NOT (description LIKE '%ccsA%' OR description LIKE '%ycf1%')

Example PHLAWD config file below:

clade = Magnoliaceae
db = /home/phylo/data/gbpln.db
mad = 0.01
coverage = 0.2
identity = 0.2
numthreads = 4
gene = its
knownfile = its.keep

search = (description LIKE '%ITS%' OR description LIKE '%internal%' OR description LIKE '%5.8S%') AND NOT (description LIKE '%18S%' OR description LIKE '%26S%')

# don't forget to include this line to activate raw SQL query feature:

Fork me on GitHub