INSTALLING IN-SILICO PCR

Installing In-Silico PCR involves four major steps:
 1) Creating sequence databases.
 2) Running the gfServer program to create in-memory indexes of the databases.
    (If you are just using the command-line gfPcr program rather than the
    web-driver webPcr, this is as far as you need to go.)
 3) Editing the webPcr.cfg file to tell it what machine and port the gfServer(s)
    are running on, and optionally customizing the webPcr appearance to users.
 4) Copying the webPcr executable and webPcr.cfg to a directory where the web server can
    execute webPcr as a CGI.

CREATING SEQUENCE DATABASES

You create databases with the program faToTwoBit. Typically you'll create
a separate database for each genome you are indexing.  Each database can
contain up to four billion bases of sequence in an unlimited number of
records.  The databases for webPcr and webBlat are identical.

The input to faToTwoBit is one or more fasta format files each of which
can contain multiple records.  If the sequence contains repeat sequences,
as is the case with vertebrates and many plants, the repeat sequences can
be represented in lower case and the other sequence in upper case.  The gfServer
program can be configured to ignore the repeat sequences.  The output of
faToTwoBit is a file which is designed for fast random access and efficient
storage.  The output files store four bases per byte.  They use a small amount
of additional space to store the case of the DNA and to keep track of runs of
N's in the input.  Non-N ambiguity codes such as Y and U in the input sequence
will be converted to N.

Here's how a typical installation might create a mouse and a human genome database:
    cd /data/genomes
    mkdir twoBit
    faToTwoBit human/hg16/*.fa twoBit/hg16.2bit
    faToTwoBit mouse/mm4/*.fa twoBit/mm4.2bit
There's no need to put all of the databases in the same directory, but it can
simplify bookkeeping.  

The databases can also be in the .nib format which was used with blat and
gfClient/gfServer until recently.  The .nib format only packed 2 bases per
byte, and could only handle one record per nib file.  Recent versions of blat
and related programs can use .2bit files as well.

CREATING IN-MEMORY INDICES WITH GFSERVER

The gfServer for webPcr uses approximately 1/2 a byte of RAM per base of the sequence
that is indexed.  The gfServer is memory intensive but typically does not require
a lot of CPU power.  The gfServer can be on the same machine as the web server for
installations just two mammalian sized genomes or less.  Many installations will
work better with the gfServers located on other machines.

A typical installation might go:
    ssh bigRamMachine
    cd /data/genomes/twoBit
    gfServer -stepSize=5 start bigRamMachine 17779 hg16.2bit &
It will take approximately half an hour for the indexing to complete on the
human genome.  The gfServer needs to be run in the same directory as the
2bit file (or nib files).  When the indexing is complete gfServer writes:
"Server ready for queries!"  Note that the -stepSize=5 option is not required
for blat, but it will make blat more sensitive.  

Two gfServer options are useful for trading off between speed and sensitivity: tileSize
and stepSize.  TileSize is the size of the DNA word indexed.  By default it is 11.
StepSize is the spacing between indexed words, which by default is equal to tileSize.
The PCR programs can find primers with a minimum perfect 3' match of 
tileSize + stepSize - 1.  With tileSize=11 and stepSize=5, the minimum perfect
match is 15.   In general if the gfServer index is to be shared with blat,
we don't recommend going smaller than tileSize=9 stepSize=4.

The option -mask will for repeat-rich genome such as human make the index use
about less space and run faster.  Generally if the gfServer is shared with blat
you don't want to use the -mask, since many 3' UTRs do include repeat sequences.
However since it is rarely fruitful to have a PCR primer in a repeat, if the
gfServer is dedicated to PCR it makes sense to use -mask.

EDITING THE WEBPCR.CFG FILE

The webPcr.cfg file tells the webPcr program where to look for gfServers and
for sequence.  The basic format of the .cfg file is line oriented with the
first word of the line being a command.  Blank lines and lines starting with #
are ignored.  The webBlat.cfg and webPcr.cfg files are similar. The webBlat.cfg
commands are:
   gfServer  -  defines host and port a gfServer is running on, the associated
                sequence directory, and the name of the database to display in
                the webPcr web page.
   background - defines the background image if any to display on web page
   company    - defines company name to display on web page
The background and company commands are optional.  The webPcr.cfg file must
have at least one valid gfServer line.  Here is a webPcr.cfg file that you
might find at a typical installation:

company Awesome Research Amalgamated
background /images/dnaPaper.jpg
gfServer bigRamMachine 17779 /data/genomes/2bit/hg16.2bit Human Genome
gfServer bigRamMachine 17781 /data/genomes/2bit/mm4.2bit Mouse Genome
# The bacto group in Jersey is ok with us using their server too
gfServer bactoGfServer.awesomeresearch.com 17779 /bacto/genomes/2bit/bacteria.2bit Bacteria


PUTTING WEBPCR WHERE THE WEB SERVER CAN EXECUTE IT

The details of this step vary highly from web server to web server.  On
a typical Apache installation it might be:
     ssh webServer
     cd kent/webPcr
     cp webPcr /usr/local/apache/cgi-bin
     cp webPcr.cfg /usr/local/apache/cgi-bin
assuming that you've put the executable and config file in kent/webPcr.
On OS-X, instead of /usr/local/apache/cgi-bin typically you'll copy stuff
to /LibraryWebServer/CGI-Executables.  Unless you are administering your
own computer you will likely need to ask your local system administrators
for help with this step.
