GenBank database management

I just committed to git the ability to create the GenBank database with the PHLAWD program instead of the python scripts. This is much much faster due to a homemade flat Genbank file parser. To use this, just pull the most recent source (you can see how to install from source here). Then, after compilation and installation, create a file with simple information, lets call it db.setup and it will have

db = pln.db
division = pln

Here, db = pln.db is just the outfile. division = pln is the division as defined by Genbank that you want to build (see all the divisions below). download just means that you want to download the files and you haven’t already done it. Then just run

PHLAWD setupdb db.setup

It will download the necessary files and put them into the database with the name that was specified in db.setup. The only planned additional functionality is more general than division downloads.

The divisions are

1. pri – primate sequences
2. rod – rodent sequences
3. mam – other mammalian sequences
4. vrt – other vertebrate sequences
5. inv – invertebrate sequences
6. pln – plant, fungal, and algal sequences
7. bct – bacterial sequences
8. vrl – viral sequences
9. phg – bacteriophage sequences
10. syn – synthetic sequences
11. una – unannotated sequences
12. est – EST sequences (expressed sequence tags)
13. pat – patent sequences
14. sts – STS sequences (sequence tagged sites)
15. gss – GSS sequences (genome survey sequences)
16. htg – HTG sequences (high-throughput genomic sequences)
17. htc – unfinished high-throughput cDNA sequencing
18. env – environmental sampling sequences

There are special codes you can use for all metazoan (met) and all seqs from divisions 1-7 (all).


Now with the new database creator, it is actually faster, or just as fast, if you are updating a completely new version, just to create a new database than to update. For daily updates, I am adding the functionality.

Fork me on GitHub