This file describes how we made the zoo browser database for each organism,
specified by $ORG

DOWNLOAD THE ZOO $ORG DATA AND SET UP A LOCAL DIRECTORY STRUCTURE
o - See the doc in /cluster/store2/zoo/zoo.readme for download instructions,
then download the data.

The following should be put in a script, but it is not yet
For each organism $ORG in the organism list
do:
    cd /cluster/store2/zoo
    mkdir zoo$ORG1
    cd zoo$ORG1
    mkdir 1 (This fakes the chromosome directory structure so we can fake a
browser track with 1 chromosome)
    cd 1
    mkdir ctgCFTR
    cp /cluster/store2/zoo/rawData/CFTRseqs/$ORG_CFTR.fasta.masked ctgCFTR
    cd ..
    ln -s ctgCFTR/$ORG_CFTR.fasta.masked ./chr1.fa
    cd /cluster/store2/zoo/zoo$ORG1
    mkdir bed

IMPORTANT STEP - FAKE THE CHR1 CHROMOSOME IN THE FA FILEs
o      NOTE: The string CFTR_$ORG needs to be changed to chr1 in all the .fa files
      files in the zoo/$ORG/1 directory. For example zoo/zooMouse1/1/chr1.fa has the string 
      mouse_CFTR in it. This needs to be changed to chr1 globally in the file. I did this by hand in each
      .fa file (all 12) to get them to work. In vi, you can do a global search and replace by typing:
      ESC :%s/mouse_CFTR/chr1/g
      The above example is for the mouse.

PROCESSING $ORG MRNA FROM GENBANK INTO DATABASE. (done)

o - Create the $ORGRNA and $ORGEST fasta files. This is usually already done.
    ssh kkstore
    cd /cluster/store1/genbank.127
    gunzip -c gbrod*.gz | gbToFaRa /projects/compbio/experiments/hg/h/anyRna.fil mrna.ra mrna.fa mrna.ta -byOrganism=org stdin gunzip -c gbrod*.gz | gbToFaRa /projects/compbio/experiments/hg/h/anyRna.fil mrna.ra mrna.127 mrna.ta -byOrganism=org stdin
    gunzip -c gbest*.gz | gbToFaRa /projects/compbio/experiments/hg/h/anyRna.fil est.ra est.fa est.ta -byOrganism=org stdin gunzip -c gbrod*.gz | gbToFaRa /projects/compbio/experiments/hg/h/anyRna.fil mrna.ra mrna.127 mrna.ta -byOrganism=org stdin


CREATING DATABASE (done)

o - ln -s /cluster/store2/zoo ~/zoo
  - make a semi-permanent read-only alias:
        alias zoo$ORG1 mysql -u hguser -phguserstuff -A zoo$ORG1
o - Make sure there is at least 20 gig free on hgwdev:/usr/local/mysql 
o The makeZoo.pl script will do the following things
o - Run the zoo/scripts/makeZoo.pl script to create the databases and store the mRNA (non-alignment) info in database.
The script will do the following for each organism:
     hgLoadRna new zoo$ORG1
     hgLoadRna add -type=mRNA zoo$ORG1 /cluster/store1/mrna.128/org/$ORG-GENUS_SP/mrna.fa /cluster/store1/mrna.128/org/$ORG-GENUS_SPECIES/mrna.ra
    If EST data is available it will also do:
     hgLoadRna add -type=EST zoo$ORG1 /cluster/store1/mrna.128/org/$ORG-GENUS_SP/est.fa /cluster/store1/mrna.128/org/$ORG-GENUS_SPECIES/est.ra
    The last line will take quite some time to complete.  

o - Add the trackDB table to the database
The script will also do this

STORING ZOO SEQUENCE AND ASSEMBLY INFORMATION  (DONE) 

Create packed chromosome sequence files 
o The makeZoo.pl script will also do the following things
- Load chromosome sequence info into database 
- Store zoo info in database.
- Make and load GC percent table

REPEAT MASKING (DONE)

We don't use the cluster for this because we have such a small input data set.
Repeat mask on sensitive settings by doing.

The makeZoo.pl script will generate the files needed to repeat mask. It will put them in zoo/work/rmskS
Then just ssh to kk, then cd to the zoo/work/rmskS directory and do a 'para push'.

When repeat masking is done, load the .out files by running the makeZoo.pl script again

MAKING AND STORING mRNA AND EST ALIGNMENTS (DONE)

o - Use BLAT to generate refSeq, mRNA and EST alignments as so:
cd to zoo/work/blat and on kk and do a para push. Check on progress using para check.

o - Process refSeq mRNA and EST alignments into near best in genome.
The below is done by the makeZoo.pl script

      Note: No lifting up is needed because the fragments are already in chromosome
coordinates.

      cd ~/zoo/bed
      cd refSeq
      pslSort dirs raw.psl /tmp psl
      pslReps -minCover=0.2 -sizeMatters -minAli=0.98 -nearTop=0.002 raw.psl contig.psl /dev/null
      mv contig.psl all_refSeq.psl
      pslSortAcc nohead chrom /tmp all_refSeq.psl
      cd ..

      cd mrna
      pslSort dirs raw.psl /tmp psl
      pslReps -minAli=0.98 -sizeMatters -nearTop=0.005 raw.psl contig.psl /dev/null
      mv contig.psl all_mrna.psl
      pslSortAcc nohead chrom /tmp all_mrna.psl
      cd ..

      cd est
      pslSort dirs raw.psl /tmp psl
      pslReps -minAli=0.98 -sizeMatters -nearTop=0.005 raw.psl contig.psl /dev/null
      mv contig.psl all_est.psl
      pslSortAcc nohead chrom /tmp all_est.psl
      cd ..

o - Load mRNA alignments into database.
      ssh hgwdev
      cd /cluster/store2/zoo/zoo$ORG1/bed/mrna/chrom
      ../../scripts/mrnaLoadPsl.csh

o - Load EST alignments into database.
      ssh hgwdev
      cd /cluster/store2/zoo/zoo$ORG1/bed/est/chrom
      ../../scripts/estLoadPsl.csh

o - Create subset of ESTs with introns and load into database.
      - ssh kkstore
      cd ~/zoo/zoo$ORG1
      cp /cluster/store1/gs.11/build28/jkStuff/makeIntronEst.sh bed/scripts
      tcsh bed/scripts/jkStuff/makeIntronEst.sh
      - ssh hgwdev
      cd ~/zoo/zoo$ORG1/bed/est/intronEst
      hgLoadPsl zoo$ORG1 *.psl

The above is all done for you in the makeZoo.pl script.

PRODUCING KNOWN GENES (todo)

o - Download everything from ftp://ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/
    into /cluster/store1/mrna.129/refSeq.
o - Unpack this into fa files and get extra info with:
       cd /cluster/store1/mrna.129/refSeq
       gunzip $ORG.gbff.gz
       gunzip $ORG.faa.gz
The above 2 files need to be symlinked from /cluster/store1/mrna.129/refSeq/$ORG/$ORG.fa /cluster/store1/mrna.128/org/$ORG/refSeq.fa
and refSeq.ra respectively.
o - Get extra info from NCBI and produce refGene table as so:
       ssh hgwdev
       cd ~/mm/bed/refSeq
       wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2ref 
       wget ftp://ncbi.nlm.nih.gov/refseq/LocusLink/mim2loc
o - Similarly download refSeq proteins in fasta format to refSeq.pep
o - Align these by processes described under mRNA/EST alignments above.
o - Produce refGene, refPep, refMrna, and refLink tables as so:
       ssh hgwdev
       cd ~/mm/bed/refSeq
       ln -s /cluster/store1/mrna.129 mrna
hgRefSeqMrna mm1 mrna/$ORGRefSeq.fa mrna/$ORGRefSeq.ra all_refSeq.psl loc2ref mrna/refSeq/$ORG.faa mim2loc
o - Add Jackson labs info
     cd ~/mm/bed
     mkdir jaxOrtholog
     cd jaxOrtholog
     ftp ftp://ftp.informatics.jax.org/pub/informatics/reports/HMD_Human3.rpt
     awk -f filter.awk *.rpt > jaxOrtholog.tab
    Load this into mysql with something like:
     mysql -u hgcat -pBIGSECRET mm1 < ~/src/hg/lib/jaxOrtholog.sql
     mysql -u hgcat -pBIGSECRET -A mm1
    and at the mysql> prompt
     load data local infile 'jaxOrtholog.tab' into table jaxOrtholog;


CREATING PAIRWISE ALIGNMENTS (TODO)
- This has been tested but needs to be customized to your specific needs.

        The makeZoo.pl script has comments on how to make pairwise alignments. The basic algorithm is as follows:

Echo all the organism names to one input file.
Echo all the mrna, est, and refSeq fasta file names to another input file.
Then run genSub2 to generate the cross-product of all sequences aligned against all mrnas and ests. Look at scripts/blat-gsub for
the gsub I used to test this. For each organism it creates a file called chr1_$X_mrna.fa and chr1_$X_est.fa, where X is the organism whose
mrna and est this organism's sequence was aligned against it using BLAT.
For example, for the zooMouse/1/chr1.fa sequence aligned against the Cat mrna we get zooMouse/bed/mrna/psl/chr1_Felis_catus_mrna.fa
The above logic is already commented out in the makeZoo.pl script. It just has to be selectively uncommented.


SIMPLE REPEAT TRACK (todo)

o - Create cluster parasol job like so:
        ssh kk
	cd ~/mm/bed
	mkdir simpleRepeat
	cd simpleRepeat
	cp ~/lastOo/bed/simpleRepeat/gsub
	mkdir trf
	echo single > single.lst
        echo /scratch/hg/gs.11/build28/contigs | wordLine stdin > genome.lst
	gensub2 genome.lst single.lst gsub spec
	para make spec
        para push 
     This job had some problems this time combined with the
     cluster going bizarre.  We ended up running it serially
     which only takes about 8 hours anyway.
        liftUp simpleRepeat.bed ~/mm/jkStuff/liftAll.lft warn trf/*.bed

o - Load this into the database as so
        ssh hgwdev
	cd ~/mm/bed/simpleRepeat
	hgLoadBed mm1 simpleRepeat simpleRepeat.bed -sqlTable=$HOME/src/hg/lib/simpleRepeat.sql


PRODUCING GENSCAN PREDICTIONS (todo)
    
o - Produce contig genscan.gtf genscan.pep and genscanExtra.bed files like so:

	First make sure you have appropriate set up, permissions, etc. and you have 
     	tried using Parasol to submit and finished a set of jobs successfully.
     	
	Log into kk
	     	cd ~/mm
     		cd bed/genscan
	Make 3 subdirectories for genscan to put their output files in
		mkdir gtf pep subopt
	Generate a list file, genome.list, of all the contigs
		ls /scratch/hg/mm1/contigs/*.fa >genome.list	
	Create a dummy file, single, containing just 1 single line of any text.	
	Create template file, gsub, for gensub2.  For example (3 lines file):
		#LOOP
		/cluster/home/fanhsu/bin/i386/gsBig {check in line+ $(path1)} {check out line gtf/$(root1).gtf} -trans={check out line pep/$(root1).pep} -subopt={check out line subopt/$(root1).bed} -exe=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/genscan -par=/cluster/home/fanhsu/projects/compbio/bin/genscan-linux/HumanIso.smat -tmp=/tmp -window=2400000
		#ENDLOOP
	Generate job list file, jobList, for Parasol
		gensub2 genome.list single gsub jobList
	First issue the following Parasol command:
		para create jobList
	Run the following command, which will try first 10 jobs from jobList
		para try
	Check if these 10 jobs run OK by
		para check
	If they have problems, debug and fix your program, template file, 
	commands, etc. and try again.  If they are OK, then issue the following 
	command, which will ask Parasol to start all the jobs (around 3000 jobs).
		para push
	Issue either one of the following two commands to check the 
	status of the cluster and your jobs, until they are done.
		parasol status
		para check
	If any job fails to complete, study the problem and ask Jim to help
	if necessary.

o - Convert these to chromosome level files as so:
     cd ~/mm
     cd bed/genscan
     liftUp genscan.gtf ../../jkStuff/liftAll.lft warn gtf/*.gtf
     liftUp genscanSubopt.bed ../../jkStuff/liftAll.lft warn subopt/*.bed
     cat pep/*.pep > genscan.pep

o - Load into the database as so:
     ssh hgwdev
     cd ~/mm/bed/genscan
     ldHgGene mm1 genscan genscan.gtf
     hgPepPred mm1 generic genscanPep genscan.pep
     hgLoadBed mm1 genscanSubopt genscanSubopt.bed


PREPARING SEQUENCE FOR CROSS SPECIES ALIGNMENTS 

Make sure that the NT*.fa files are lower-case repeat masked, and that
the simpleRepeat track is made.  Then put doubly (simple & interspersed)
repeat mask files onto /cluster local drive as so.
    ssh kkstore
    mkdir /scratch/hg/mm1/trfFa
    cd ~/mm
    foreach i (? ?? NA_*)
	cd $i
        foreach j (chr${i}_*)
	    if (-d $j) then
		maskOutFa $j/$j.fa ../bed/simpleRepeat/trf/$j.bed -softAdd /scratch/hg/mm1/trfFa/$j.fa.trf
		echo done $i/$j
	    endif
	end
	cd ..
    end
Then ask admins to do a binrsync.


DOING HUMAN/$ORG ALIGMENTS (todo)

o - Download the lower-case-masked assembly and put it in
    kkstore:/cluster/store1/a2ms.
   
o - Mask simple repeats in addition to normal repeats with:
        mkdir ~/mm/jkStuff/trfCon
	cd ~/mm/allctgs
	/bin/ls -1 | grep -v '\.' > ~/mm/jkStuff/trfCon/ctg.lst
        cd ~/mm/jkStuff/trfCon
	mkdir err log out
    edit ~/mm/jkStuff/trf.gsub to update gs and mm version
	gensub ctg.lst ~/mm/jkStuff/trf.gsub
	mv gensub.out trf.con
	condor_submit trf.con
    wait for this to finish.  Check as so
        cd ~/mm
	source jkStuff/checkTrf.sh
    there should be no output.
o - Copy the RepeatMasked and trf masked genome to
    kkstore:/scratch/hg/gs.11/build28/contigTrf, and ask
    Jorge and all to binrsync to propagate the data
    across the new cluster.
o - Download the assembled $ORG genome in lower-case
    masked form to /cluster/store1/arachne.3/whole.  
    Execute the script splitAndCopy.csh to chop it
    into roughly 50M pieces in arachne.3/parts
o - Set up the jabba job to do the alignment as so:
       ssh kkstore
       cd /cluster/store2/mm.2001.11/mm1
       mkdir blat$ORG.phusion
       cd blat$ORG.phusion
       ls -1S /scratch/hg/gs.3/build28/contigTrf/* > human.lst
       ls -1 /cluster/store1/arachne.3/parts/* > $ORG.lst
    Make a file 'gsub' with the following three lines in it
#LOOP
/cluster/home/kent/bin/i386/blat -q=dnax -t=dnax {check in line+ $(path2)} {check in line+ $(path1)}  {check out line+ psl/$(root2)_$(root1).psl} -minScore=20 -minIdentity=20 -tileSize=4 -minMatch=2 -oneOff=0 -ooc={check in exists /scratch/hg/h/4.pooc} -qMask=lower -mask=lower
#ENDLOOP
    Process this into a jabba file and launch the first set
    of jobs (10,000 out of 70,000) as so:
        gensub2 $ORG.lst human.lst gsub spec
	jabba make hut spec
	jabba push hut
    Do a 'jabba check hut' after about 20 minutes and make sure
    everything is right.  After that make a little script that
    does a "jabba push hut" followed by a "sleep 30" about 50
    times.  Interrupt script when you see jabba push say it's
    not pushing anything.

o - Sort alignments as so 
       ssh kkstore
       cd /cluster/store2/mm.2001.11/mm1/blat$ORG
       pslCat -dir -check psl | liftUp -type=.psl stdout ../liftAll.lft warn stdin | pslSortAcc nohead chrom /cluster/store2/temp stdin
o - Get rid of big pile-ups due to contamination as so:
       cd chrom
       foreach i (*.psl)
           echo $i
           mv $i xxx
           pslUnpile -maxPile=600 xxx $i
       rm xxx
       end
o - Remove long redundant bits from read names by making a file
    called subs.in with the following line:
        gnl|ti^ti
        contig_^tig_
    and running the commands
        cd ~/$ORG/vsOo33/blat$ORG.phusion/chrom
	subs -e -c ^ *.psl > /dev/null
o - Copy over to network where database is:
        ssh kks00
	cd ~/mm/bed
	mkdir blat$ORG
	mkdir blat$ORG/ph.chrom600
	cd !$
        cp /cluster/store2/mm.2001.11/mm1/blat$ORG.phusion/chrom/*.psl .
o - Rename to correspond with tables as so and load into database:
       ssh hgwdev
       cd ~/mm/bed/blat$ORG/ph.chrom600
       foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_blat$ORG.psl
       end
       hgLoadPsl mm1 *.psl
o - load sequence into database as so:
	ssh kks00
	faSplit about /projects/hg3/$ORG/arachne.3/whole/Unplaced.mfa 1200000000 /projects/hg3/$ORG/arachne.3/whole/unplaced
	ssh hgwdev
	hgLoadRna addSeq '-abbr=gnl|' mm1 /projects/hg3/$ORG/arachne.3/whole/unpla*.fa
	hgLoadRna addSeq '-abbr=con' mm1 /projects/hg3/$ORG/arachne.3/whole/SET*.mfa
    This will take quite some time.  Perhaps an hour .

o - Produce 'best in genome' filtered version:
        ssh kks00
	cd ~/$ORG/vsOo33
	pslSort dirs blat$ORGAll.psl temp blat$ORG
	pslReps blat$ORGAll.psl best$ORGAll.psl /dev/null -singleHit -minCover=0.3 -minIdentity=0.1
	pslSortAcc nohead best$ORG temp best$ORGAll.psl
	cd best$ORG
        foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_best$ORG.psl
        end
o - Load best in genome into database as so:
	ssh hgwdev
	cd ~/$ORG/vsOo33/best$ORG
        hgLoadPsl mm1 *.psl

PRODUCING CROSS_SPECIES mRNA ALIGMENTS (done)

This is done by the makeZoo.pl script. It sets up the BLAT job, which has to be run by hand on kk.
After that the script will load the xeno mrna into the db.
The cross-species alignments are done a little differently here, in that, instead of aligning each
species against all other species, I have aligned each species in the group of 12 against the other 11. 
For example, mouse is aligned against the other 11 species in the list.
See the spec file in zoo/work/blat.

PRODUCING FISH ALIGNMENTS (todo)

o - Download sequence from ... and put it in 
	ssh kks00
       /projects/hg3/fish/seq15jun2001/bqcnstn_0106151510.fa
    then
	ln -s /projects/hg3/fish ~/fish
    split this into multiple files and compress original with
	cd ~/fish/seq15jun2001
        faSplit sequence bq* 100 fish
	compress bq*
	copy over to kkstore:/cluster/store2/fish/seq15jun2001
o - Do fish/human alignments.
       ssh kkstore
       cd /cluster/store2/fish
       mkdir vsOo33
       cd vsOo33
       mkdir psl
       ls -1S /var/tmp/hg/gs.11/build28/tanMaskNib/*.fa.trf > human.lst
       ls -1S /projects/hg3/fish/seq15jun2001/*.fa > fish.lst
     Copy over gsub from previous version and edit paths to
     point to current assembly.
       gensub2 fish.lst human.lst gsub spec
       jabba make hut spec
       jabba try hut
     Make sure jobs are going ok.  Then
       jabba push hut
     wait about 2 hours and do another
       jabba push hut
     do a jabba check hut, and if necessary another push hut.
o - Sort alignments as so 
       pslCat -dir psl | liftUp -type=.psl stdout ~/mm/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin
o - Copy to hgwdev:/scratch.  Rename to correspond with tables as so and 
    load into database:
       ssh hgwdev
       cd ~/fish/vsOo33/chrom
       foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_blatFish.psl
       end
       hgLoadPsl mm1 *.psl



TIGR GENE INDEX (todo)

  o - Download ftp://ftp.tigr.org/private/HGI_ren/*aug.tgz into
      ~/oo.29/bed/tgi and then execute the following commands:
          cd ~/oo.29/bed/tgi
	  mv cattleTCs_aug.tgz cowTCs_aug.tgz
	  foreach i ($ORG cow human pig rat)
	      mkdir $i
	      cd $i
	      gtar -zxf ../${i}*.tgz
	      gawk -v animal=$i -f ../filter.awk * > ../$i.gff
	      cd ..
	  end
	  mv human.gff human.bak
	  sed s/THCs/TCs/ human.bak > human.gff
	  ldHgGene -exon=TCs hg7 tigrGeneIndex *.gff


      
LOAD STS MAP 
     - login to hgwdev
      cd ~/mm/bed
      mm1 < ~/src/hg/lib/stsMap.sql
      mkdir stsMap
      cd stsMap
      bedSort /projects/cc/hg/mapplots/data/tracks/build28/stsMap.bed stsMap.bed
      - Enter database with "mm1" command.
      - At mysql> prompt type in:
          load data local infile 'stsMap.bed' into table stsMap;
      - At mysql> prompt type
          quit


LOAD CHROMOSOME BANDS
      - login to hgwdev
      cd /cluster/store2/mm.2001.11/mm1/bed
      mkdir cytoBands
      cp /projects/cc/hg/mapplots/data/tracks/build28/cytobands.bed cytoBands
      mm1 < ~/src/hg/lib/cytoBand.sql
      Enter database with "mm1" command.
      - At mysql> prompt type in:
          load data local infile 'cytobands.bed' into table cytoBand;
      - At mysql> prompt type
          quit

LOAD $ORGREF TRACK (todo)
    First copy in data from kkstore to ~/mm/bed/$ORGRef.  
    Then substitute 'genome' for the appropriate chromosome 
    in each of the alignment files.  Finally do:
       hgRefAlign webb mm1 $ORGRef *.alignments

LOAD AVID $ORG TRACK (todo)
      ssh cc98
      cd ~/mm/bed
      mkdir avid$ORG
      cd avid$ORG
      wget http://pipeline.lbl.gov/tableCS-LBNL.txt
      hgAvidShortBed *.txt avidRepeat.bed avidUnique.bed
      hgLoadBed avidRepeat avidRepeat.bed
      hgLoadBed avidUnique avidUnique.bed


LOAD SNPS (todo)
      - ssh hgwdev
      - cd ~/mm/bed
      - mkdir snp
      - cd snp
      - Download SNPs from ftp://ftp.ncbi.nlm.nih.gov/pub/sherry/gp.oo33.gz
      - Unpack.
        grep RANDOM gp.oo33 > snpTsc.txt
        grep MIXED  gp.oo33 >> snpTsc.txt
        grep BAC_OVERLAP  gp.oo33 > snpNih.txt
        grep OTHER  gp.oo33 >> snpNih.txt
        awk -f filter.awk snpTsc.txt > snpTsc.contig.bed
        awk -f filter.awk snpNih.txt > snpNih.contig.bed
        liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed
        liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed
	hgLoadBed mm1 snpTsc snpTsc.bed
	hgLoadBed mm1 snpNih snpNih.bed

LOAD CPGISSLANDS (todo)
     - login to hgwdev
     - cd /cluster/store2/mm.2001.11/mm1/bed
     - mkdir cpgIsland
     - cd cpgIsland
     - mm1 < ~kent/src/hg/lib/cpgIsland.sql
     - wget http://genome.wustl.edu:8021/pub/gsc1/achinwal/MapAccessions/cpg_aug.oo33.tar
     - tar -xf cpg*.tar
     - awk -f filter.awk */ctg*/*.cpg > contig.bed
     - liftUp cpgIsland.bed ../../jkStuff/liftAll.lft warn contig.bed
     - Enter database with "mm1" command.
     - At mysql> prompt type in:
          load data local infile 'cpgIsland.bed' into table cpgIsland

LOAD ENSEMBLE GENES (todo)
     cd ~/mm/bed
     mkdir ensembl
     cd ensembl
     wget http://www.sanger.ac.uk/~birney/all_april_ctg.gtf.gz
     gunzip all*.gz
     liftUp ensembl110.gtf ~/mm/jkStuff/liftAll.lft warn all*.gtf
     ldHgGene mm1 ensGene en*.gtf
o - Load Ensembl peptides:
     (poke around ensembl to get their peptide files as ensembl.pep)
     (substitute ENST for ENSP in ensemble.pep with subs)
     wget ftp://ftp.ensembl.org/pub/current/data/fasta/pep/ensembl.pep.gz
     gunzip ensembl.pep.gz
   edit subs.in to read: ENSP|ENST
     subs -e ensembl.pep
     hgPepPred mm1 generic ensPep ensembl.pep

LOAD SANGER22 GENES (todo)
      - cd ~/mm/bed
      - mkdir sanger22
      - cd sanger22
      - not sure where these files were downloaded from
      - grep -v Pseudogene Chr22*.genes.gff | hgSanger22 mm1 stdin Chr22*.cds.gff *.genes.dna *.cds.pep 0
          | ldHgGene mm1 sanger22pseudo stdin
         - Note: this creates sanger22extras, but doesn't currently create
           a correct sanger22 table, which are replaced in the next steps
      - sanger22-gff-doctor Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \
          | ldHgGene mm1 sanger22 stdin
      - sanger22-gff-doctor -pseudogenes Chr22.3.1x.cds.gff Chr22.3.1x.genes.gff \
          | ldHgGene mm1 sanger22pseudo stdin

      - hgPepPred mm1 generic sanger22pep *.pep

LOAD SANGER 20 GENES (todo)
     First download files from James Gilbert's email to ~/mm/bed/sanger20 and
     go to that directory while logged onto hgwdev.  Then:
        grep -v Pseudogene chr_20*.gtf | ldHgGene mm1 sanger20 stdin
	hgSanger20 mm1 *.gtf *.info


LOAD RNAGENES (todo)
      - login to hgwdev
      - cd ~kent/src/hg/lib
      - mm1 < rnaGene.sql
      - cd /cluster/store2/mm.2001.11/mm1/bed
      - mkdir rnaGene
      - cd rnaGene
      - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz
      - gunzip *.gz
      - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff
      - hgRnaGenes mm1 chrom.gff

LOAD EXOFISH (todo)
     - login to hgwdev
     - cd /cluster/store2/mm.2001.11/mm1/bed
     - mkdir exoFish
     - cd exoFish
     - mm1 < ~kent/src/hg/lib/exoFish.sql
     - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr)
       into /cluster/store2/mm.2001.11/mm1/bed/exoFish/all_maping_ecore
     - awk -f filter.awk all_maping_ecore > exoFish.bed
     - hgLoadBed mm1 exoFish exoFish.bed

LOAD $ORG SYNTENY (todo)
     - login to hgwdev.
     - cd ~/kent/src/hg/lib
     - mm1 < $ORGSyn.sql
     - mkdir ~/mm/bed/$ORGSyn
     - cd ~/oo/bed/$ORGSyn
     - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as
       $ORGSyn.txt
     - awk -f format.awk *.txt > $ORGSyn.bed
     - delete first line of $ORGSyn.bed
     - Enter database with "mm1" command.
     - At mysql> prompt type in:
          load data local infile '$ORGSyn.bed' into table $ORGSyn


LOAD GENIE (todo)
     - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf
     - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf
     - ldHgGene mm1 genieAlt predChrom.gtf

     - cat */ctg*/ctg*.affymetrix.aa > pred.aa
     - hgPepPred mm1 genie pred.aa 

     - mm1
         mysql> delete * from genieAlt where name like 'RS.%';
         mysql> delete * from genieAlt where name like 'C.%';

LOAD SOFTBERRY GENES (todo)
     - ln -s /cluster/store2/mm.2001.11/mm1 ~/mm
     - cd ~/mm/bed
     - mkdir softberry
     - cd softberry
     - get ftp://www.softberry.com/pub/SC_MOU_NOV01/softb_mou_genes_nov01.tar.gz
     - mkdir output
     - Run the fixFormat.pl routine in
        ~/mm/bed/softberry like so:
        ./fixFormat.pl output
     - This will stick all the properly converted .gff and .pro files in
     the directory named "output".
     - cd output
     ldHgGene mm1 softberryGene chr*.gff
     hgPepPred mm1 softberry *.pro
     hgSoftberryHom mm1 *.pro

LOAD ACEMBLY
    - Get acembly*gene.gff from Jean and Michelle Thierry-Miegs and
      place in ~/mm/bed/acembly
    - Replace c_chr with chr in acembly*.gff
    - Replace G_t1_chr with chr and likewise
      G_t2_chr with chr, etc.
    - cd ~/mm/bed/acembly
    - # The step directly below is not necessary since the files were already lifted
      #  liftUp ./aceChrom.gff /cluster/store2/mm.2001.11/mm1/jkStuff/liftHs.lft warn acemblygenes*.gff
    - Use /cluster/store2/mm.2001.11/mm1/mattStuff/filterFiles.pl to prepend "chr" to the
    start of every line in the gene.gff files and to concat them into the aceChrom.gff
    gile. Read the script to see what it does. It's tiny and simple.
    - Concatenate all the protein.fasta files into a single acembly.pep file
    - Load into database as so:
        ldHgGene mm1 acembly aceChrom.gff
        hgPepPred mm1 generic acemblyPep acembly.pep

LOAD GENOMIC DUPES (todo)
o - Load genomic dupes
    ssh hgwdev
    cd ~/mm/bed
    mkdir genomicDups
    cd genomicDups
    wget http://codon/jab/web/takeoff/oo33_dups_for_kent.zip
    unzip *.zip
    awk -f filter.awk oo33_dups_for_kent > genomicDups.bed
    mysql -u hgcat -pbigSECRET mm1 < ~/src/hg/lib/genomicDups.sql
    hgLoadBed mm1 -oldTable genomicDups genomicDupes.bed

FAKING DATA FROM PREVIOUS VERSION
(This is just for until proper track arrives.  Rescues about
97% of data  Just an experiment, not really followed through on).

o - Rescuing STS track:
     - log onto hgwdev
     - mkdir ~/mm/rescue
     - cd !$
     - mkdir sts
     - cd sts
     - bedDown hg3 mapGenethon sts.fa sts.tab
     - echo ~/mm/sts.fa > fa.lst
     - pslOoJobs ~/mm ~/mm/rescue/sts/fa.lst ~/mm/rescue/sts g2g
     - log onto cc01
     - cc ~/mm/rescue/sts
     - split all.con into 3 parts and condor_submit each part
     - wait for assembly to finish
     - cd psl
     - mkdir all
     - ln ?/*.psl ??/*.psl *.psl all
     - pslSort dirs raw.psl temp all
     - pslReps raw.psl contig.psl /dev/null
     - rm raw.psl
     - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl
     - rm contig.psl
     - mv chrom.psl ../convert.psl


