- handle HTS for terry

- change download to generate single md5 for pair of .gbff and .fsa

- lock some table during update (gbStatus).

- update naming between refseq and genbank should be made consistent,
  maybe include date in full?

- entries duplicated in updates get aligned multiple times, end up
  in index multiple times.  Need to make sure doesn't get loaded
  multiple times (see "AF530914")

- testing:
     - download, process
     - test/make-test-download

- selectVer field use is more limited, can it be dropped?

- clean up hgRelate junk

- all error messages with update should include the release too.
  - use common method.

- skip alignment if sequence is already outdated in an update?

- gb reference, authors, journal may occur multiple times, only first
  kept.  Renable warning in readOneField to get list.

- need to remove align work directory (with option to keep by date).

- **** deal with not downloading outdated refseq dailies.  Makesure that
       when the same version/moddate is in full and an daily, that
       full wins (this appears to not be the case).

- delete hgRefSeqStatus.c, hgRefSeqMrna

- gbSanity:
      - add check of sequence in fasta

- keep versions in accesion tables

- need to verify that all of the features that can contain db_xref attributes
  are handled by gbToFaRa (or we just select the onces we are interested in).

- check that we still have matt's change to not abort on a bad extFile table.

- delete hgRefSeqStatus

- add checks of refFlat, refGene, refAlign

- gbGetTest creates empty .seq files, and not gziped ones

- AJ431705 - a EST in gbest28.seq.gz,
   error in  download/genbank.132.0/daily-nc/nc1113.flat.gz, made it an
   mrna.  (latter corrected);

- why does reflink store product names as both id and string,
  also annoying that id doesn't follow convention of other tables
  that the name is the same.

- gbBlat location in alignment job file may not be valid if run on
  a different machine.

- empty processed fasta files are created.


- build geneName table for Fan from /gene= features.
        http://www.gene.ucl.ac.uk/nomenclature/
        http://www.gene.ucl.ac.uk/public-files/nomen/nomeids.txt
   Issues:
       Names with spaces: AB005521S1 /gene="ppar gamma"
       Weird names: AB006000 /gene="hchm-I(3-6)"
                    AAU57044 /gene="gamma1/gamma2"
       Multiple names: AB001517 /gene="TMEM1", /gene="PWP2", /gene="KNP-I"


- looks like refseq locus can have multiple accessions!!
        LOCUS       AOC2
        ACCESSION   NM_009590
                             /gene="AOC2"
        ACCESSION   NC_001776
        ACCESSION   NC_002184
        
- *** raToFaRa needs to parse multiple gene names, which might contain
      spaces  refseq gene name perhaps should be locus rather than 
      /gene=.  Old used CDS /gene= name.

  -   For NM_130786 (reviewed), we have
        LOCUS       A1BG
        gene            1..3386
                     /gene="A1BG"
                     /note="synonyms: A1B, ABG, GAB"
      For NM_020469 (provisional)
        LOCUS       ABO
             gene            1..1065
                     /gene="ABO"

      For hugo CCRN4L ->AF183961, NM_012118
        LOCUS       AF183961
        ACCESSION   AF183961

      For ROBO2 -> AF040991
        LOCUS       AF040991
             gene            <1..>854
                     /gene="robo2"
             CDS             <1..>854
                     /gene="robo2"

- specify --buffer-size for the gbAlignFinish sort commands.

- somehow report gbLoadRna warnings.

- what would be required to use a database for data:
   - browser db would have to have copy of databases to do select.
   
- make blat take coordinates to extract from nib.

- put hgLoadTabFile bacl to the way it was if we don't use any more.

- index access patterns:
   * could we used entries that have just the lates processed and latest
     aligned?  Think we need processed matching aligned version.
   - processed - just creates index files, doesn't read.
   - gbAlignGet
        - does one update at a time, 
        - loads processed for current release.
        - if full, loads previous release aligned for genome (not processed)
        - loads current processed (but not aligned, since working on an
          unaligned update).
        - foreach processed entry in update:
          gbUpdateFindNeedAligned:
            - check if proc-entry is new or version changed in this update
              - if have prevAligned, check prev aligned entry.
        - foreach fastaRec, check if entry to see if selected.
      **  this could load a single processed update at a time, but needs
          entire previous aligned.

   - gbAlignInstall:
        - does one update at a time, 
        - loads processed for current release.
        - if full, loads previous release aligned for genome (not processed)
        - gbUpdateFindNeedAligned:
           - to traverse processed + prevAlign and find
             all entries to migrate in current alignment.
        - foreach prevUpdate, traverse psls and check if flagged for migration
        - traverse psl, and save flagged
        - gbUpdateFindNeedAligned - build index

   - gbLoadRna
        - load processed and aligned.
        - load status table and add aux info:
           - load gbStatus  *big*
           - foreach gbStatus selected 
                - not in aligned -> mark deleted
                - version increased -> mark seqChg
                - modDate increased -> mark metaChg
                - else -> mark noChg
           - foreach aligned
                - not in gbStatus -> mark new
           - foreach seq not in gbStatus 
                - flag orphaned
        - read ra files

   - gbSanity
        - 

- Combine type and source DB in gbStatusTbl in the same manner as seq
  or make seq compatbile???

- should be able to load aligned/processed for one update?? maybe not
  always useful because of metaDataChg being in differet update.

- `select' struct and `select' flag names confusing

- check size of hash tables

- reruning gbAlignSetup on partially completed jobs doesn't appear to
  generate a subset.  Also, there is no way to check if it completed
  anyway, sicne we don't make jobs without output.  Probably need
  a sanit ycheck.

- need to make sure we have new version of programs:
   - pslReps
   - hgLoadPsl
   - pslIntronsOnly

- mrna names are varchar(255), could probably be shorter.

- frequnt failures doing download; nd a retry mchanism.

- perl error message are a bit ugly due to trap shit.

- have alignFinish write files direcly, then install index from them.
  Saves one I/O pass.

- add sanity check for having someintron psls

- mrnaClone table is huge, maybe shouldn't be a unique string table.


- PENDING TASKS
   - deployment doc

- should refseq pep sequences be in gbStatus?

- double .tmp: ls processed/genbank.132.0/full/
   mrna.fa.tmp.tmp  mrna.fil.tmp  mrna.gbidx.tmp  mrna.ra.tmp

- `delete outdated alignments' alignment message should be entery/level

- gbLoadRna should generate errors in some cases where nothing is done.

- add doc about the way to correct a full download with a corrupt file
  is to remove the corrupt file and the *.md5 file.

- gbLoadRna - don't generabte delet outdate, etc message if there are
  no outdated.

- it would be nice to have each aligned update have all of the
  sequences in it aligned, even if they are not seqChg or now.  This
  would make removing an update less damaging.  Maybe migrate from
  other update???

- build list of refseqs without dbref for mim and send to ncbi

- **** Have some limit to prevent deleting large number of entries

- reflink diffs:
    - ~92 refseqs (132.0) don't have OMIM IDS because the
        gbff don't have FEATURES/gene/db_xref="MIM:191195"
      need to report this to ncbi

- **** Corrupt (truncated) alidx created and not detected.  Not sure
       how this happened or how to detect.  sqlnum didn't report error
       was part of the problems.  Maybe reread after writing, but before
       closing???

- Determine if all refseqs should have mim/locus links ids, and if so, report
  as an error.

- add better bookkeeping during the alignment process, keeping lists
  of acc that aligned or did not align, where they aligned and 
  ones that need migrated.  Also add bookkeeping for intronPsl

- chagne est-intron check to include other splice sites.

- need method to migrate links

- blat tmp file for  chrX.nib:chrX_5:39292303-56445259 is /var/tmp/gbBlat.19070.tmp/chrX_5.psl

- psl and oi can end up in different order if multiple align to the same loc.

- selectOrientInfo doesn't take into account multiple alignments of
  the same acc at the same look, so OI ends up larger
    - fix #ifdef-ed check in gbAlignInstall

- aligned files should be write protected and checksumed

- finish align version is a bit weird; gbAlignInstall makes noise
  all of the time, but the higher-level script doesn't, which is 
   a bit confusing to debug.

- report mRNA that is really intergenic

- add data/ route for directory

### Task list ###
  - cleanup of load directory
  - make sure index files are carefully check for write errors, we
    have acopuple of trauncated ones, also when parse.
  - add processing of HTS sets

- add checks for *.alidx files being there for *.gbidx.

- X63368.2, in genbank.133.0 gbpri23 is marked as a DNA.!

- fix paths in gbBlat

- move sqlTrace global to sqlConnection

- add some kind of complete flag for alignments

- Add sanity checks of release and update info in gbStatus.

- check indcex of gbStatus, seq (include extFile id).

- use blat, etc from our own directory.

- does seq table index typeSrcDbAcc2(type, srcDb, acc(2))),
  split so that it works without acc(2)


- looks like some are mm2 are `Mus musculus musculus'

- Update of file names:
  - if seq release is older than current release, but seq not changed,
    and to fa update list.
  - 

-make sure refseq prot fa gets updated with seq fa.

- *** Need to have a way to track alignment version, to handle realignment
  with different parameters.  

- refseq peptides are not deleted when no longer referenced

- keep our own copy of all progs

- refseq peptide sequences that are no longer refereced are not deleted
  and don't have their paths changed

- use para output checks for gbBlat

- don't do unneeded steps in dbload.

- always log state counts in gbLoadrna

- add indices to seq to speed up gbSanity.


-** these accessions had multiple portions of the mRNA align to the
    exact same range of the genome.  Check out why??>
    AF064839.1
    AK025862.1
    AK025986.1
      X98410.1

- Remove OI count check in migrate, due to above case.  We don't count
  intronPsls anyway.  Add sanity check that all intronPsls have psls
  and all psls have OIs..

- alignrun failes if nothing to align (not jobs file generated.  Should
  check.

- add better parameter validation to gbblat (require .psl.gz)

- -gbAlignInstall should print a message about what is being processed..

- add checks of database for initialLoad flag.

- both gbLoadRna and gbSanity have messages about `loading' tables, although
  having the oposite meaning.  Confusing.

- don't need to build xeno.alidx for refseq..

- either remove hgRefSeqMrna or change to use function now in genePred.h

- have scripts send reminder if semaphore files detected which are more than a 
  day old

- full implement gbSeq.version files, including adding to gbSanity

- add orgCat to gbStatus

- Make process/align files read-only
        also log files.

- *** Would be really good to figure out how to add a date to
      the genbank update directories.

- normalize verbose messages

- fix weird clone ids:
        AF072864	IMAGE:25838: 138-g24
        AB063297	IMAGE: 1837972  (space
        IMAGE Consortium ID: 504484
        AF132495	IMAGE: 512859, 731044, 509500
        AF159441	ac34a06.sl; IMAGE: 1375897
        NM_020359	ac34a06.sl; IMAGE: 1375897

                
- should filter stuff be kept in gbToFaRa, or mayde command line??

- Make sure genePred conversions is sane with what Fan did

- add $root/var/ directory for logs and run files.

- is seq.gb_date really needed???

- allow filter to be specified on gbToFaRa cmd line.

- use explict paths to executables to make sure we don't pick up wrong ones.

- (AF201929) in the "SARS mRNA" track is actually a full murine hepatitus strain 2 viral genome (bad label in

- need a way to check if mgc tables are out-of-date.

- dealing with maxShrinkage errors is a pain, need to rerun by had to get list to
   check, then run again to override.

- lock gbStatus or gbLoaded during update, can be used as a check by
  other programs while not blocking browser

- alignment parameters

 slightly better to add -q=rna on the native mRNA as well. 

    refseq mrna: (drop -trimHardA)
         blat -q=rna -fine -ooc=/scratch/hg/h/11.ooc
         pslReps -minCover=0.15 -sizeMatters -minAli=0.98 -nearTop=0.001
    native mrna:
         blat -q=rna -fine -ooc=/scratch/hg/h/11.ooc
         pslReps -minAli=0.98 -sizeMatters -nearTop=0.005
    native est:
         blat -mask=lower -ooc=/scratch/hg/h/11.ooc
         pslReps -minAli=0.98 -sizeMatters -nearTop=0.005
    xeno mrna:
         blat -q=rnax -t=dnax -mask=lower
         pslReps -minAli=0.25
    xeno est:
         blat -q=dnax -t=dnax -mask=lower
         pslReps -minAli=0.10

- mgc tar file has abs paths

- hmm, looks like:refseq status are: Predicted,Provisional,Reviewed,Unknown
        -validated lost, also INFERRED,

- why doesn't gbLoadRna use the ignore tables from in gbRelease.

- dir structures
    doc/
    etc/
    test/
    src/
        align/
        dbload/
        download/
        gbGetSeqs/
        gbLoadRna/
        gbSanity/
        gbToFaRa/
        hgLoadSeq/
        inc/
        lib/
        makefile
        mgc/
        process/
        selectWithPsl/

- *** Some refseqse don't have status: NC_001224  Is this ok??>
        - why does this get saved to begin with??

- add mgc gene name to mgcStatus.
   - need to handle MGC being select multiple times

- change nib glob to nib directory.

- errors can be hard to see in the log files:
  mkdir work/initial.ci1 at /cluster/store5/genbank/bin/../lib/gbCommon.pm line 176. at /cluster/store5/genbank/bin/../lib/gbCommon.pm line 176.
  command failed: gbAlignSetup -workdir=work/initial.ci1/align -verbose=1 -clusterdir=/iscratch/genbank ci1 '/scratch/hg/ci1/nib/*.nib' "" at /cluster/store5/genbank/bin/../lib/gbCommon.pm line 176. at /cluster/store5/genbank/bin/../lib/gbCommon.pm line 176.

- make sure directorys have the group stick bit maintained.

- make sure all I/O inside of perl does error checking.

- need to follow Kent argument conventions completely:
   bin/gbDbLoadStep -verbose=1 -initialLoad ci1 -noPerChrom=ci1 &

- need to split mgcDownloadStep into download and process scripts, to
  make it easier to rerun when processing fails.

- alignment tmp files should not be automatically purged on initial load

- iserver directories not group write?  need to doc cleaning this
  up.

- some data things to looks at:

  - bad CDS annotation:
    - MGC BC012901
    - MGC BC001477
   - NM_153032/BC003669 - very different CDS annotations.

     -mgcs with no cds annotation

- check for loading with old table (string tbl crc column works)

- POSSIBLE BUG: If an update dbload started while alignments are being
  finished, what would happen.

- change align and load functions to do one database at a time;
  keeps special options simpler at the cost of more log files.

- NCBI seems to be writing refseq files in an unsafe way:
       size at end of download (769484668) does not match size at start (764975036) for ftp://ftp.ncbi.nih.gov/refseq/cumulative/rscu.gbff.Z at /cluster/store5/genbank/bin/../lib/gbFtp.pm line 190. at /cluster/store5/genbank/bin/../lib/gbCommon.pm line 179.
       2003.06.17-21:55:32.download.log
       mod time: Jun 18 08:15

- aligned/genbank.136 didn't endup with right perms *** (work dirs too).

- copy log messsage is confusing, only shows the cpio output of 0 blocks.

- You should probably doc the fact that hgwdev's (dbload) lock is global.

- add lock of gbStatus durring load, just to be paranoid.

- add in exists checks to blat para.

- make sure that full is done first on a new release; check the
  logic with the release update.

 
- refseq CDS problems:
  Warning: NM_012577: malformed RefSeq CDS: join(1,2..633)
  Warning: NR_001363: malformed RefSeq CDS: 
  Warning: NR_001368: malformed RefSeq CDS: 
  - also, still problem with some MGC's cds.

- add type and source db to image clone; makes it easy to clean out.

- can the refseq pep sequences be extract from the the genbank records
  rather than downloaded seperately?

- index string tables as full text

- options for using /bluearc
- gbExtFile table includes a . in path:
  /gbdb/genbank/./data/processed/genbank.136.0/full/mrna.fa

- move species names to conf.

- move blat parms to conf

- have scripts print full usage message.

- alignment seems slower" build times for old hg15:
    mrna: Completed: 546 of 546 jobs
    CPU time in finished jobs:     149115s    2485.24m    41.42h    1.73d  0.005 y
    IO & Wait Time:                116248s    1937.47m    32.29h    1.35d  0.004 y
    Average job time:                 486s       8.10m     0.14h    0.01d
    Longest job:                     6977s     116.28m     1.94h    0.08d
    Submission to last job:          6977s     116.28m     1.94h    0.08d

    est: Completed: 6006 of 6006 jobs
    CPU time in finished jobs:     670646s   11177.43m   186.29h    7.76d  0.021 y
    IO & Wait Time:               1756084s   29268.07m   487.80h   20.33d  0.056 y
    Average job time:                 404s       6.73m     0.11h    0.00d
    Longest job:                     1100s      18.33m     0.31h    0.01d
    Submission to last job:          2839s      47.32m     0.79h    0.03d

    xenoMrna: Completed: 25662 of 25662 jobs
    CPU time in finished jobs:    2721254s   45354.23m   755.90h   31.50d  0.086 y
    IO & Wait Time:                 76077s    1267.95m    21.13h    0.88d  0.002 y
    Average job time:                 109s       1.82m     0.03h    0.00d
    Longest job:                     6739s     112.32m     1.87h    0.08d
    Submission to last job:          7650s     127.50m     2.12h    0.09d

    xenoEst: Completed: 40950 of 40950 jobs
    CPU time in finished jobs:   60566146s 1009435.76m 16823.93h  701.00d  1.921 y
    IO & Wait Time:               3218137s   53635.62m   893.93h   37.25d  0.102 y
    Average job time:                1558s      25.96m     0.43h    0.02d
    Longest job:                    35550s     592.50m     9.88h    0.41d
    Submission to last job:        130925s    2182.08m    36.37h    1.52d
    
    refSeq: Completed: 44 of 44 jobs
    CPU time in finished jobs:      11930s     198.84m     3.31h    0.14d  0.000 y
    IO & Wait Time:                 16931s     282.18m     4.70h    0.20d  0.001 y
    Average job time:                 656s      10.93m     0.18h    0.01d
    Longest job:                     2355s      39.25m     0.65h    0.03d
    Submission to last job:          2402s      40.03m     0.67h    0.03d

    total CPU: 41.42+186.29+755.90+16823.93+3.31 = 17810.85h
    total wait: 32.29+487.80+21.13+893.93+4.70   =  1439.85h

without refseq:
    total CPU: 41.42+186.29+755.90+16823.93 = 17807.54h
    total wait: 32.29+487.80+21.13+893.93   =  1435.15h

 
- version remains on refseq pep sequences in seq table: NP_787018.1

- if type changes on initial load, it's not detected.  A verification
  program run after gbProcess would be good.

- NCBI index has species wrong for AB107958

- NC_005036 is incorrectly classified as mRNA

- some kind of check for not running gbAlignStep on the right machine.


- http://pubcrawler.gen.tcd.ie/program.html - program that monitors genbank.

- gbGetSeqs should have an option take an accession.

- storing fa offset in gbidx would speed up gbGetSeqs

- para make queue size maybe too small

- in sqlLoadTablFile make this version 10 robust:
  boolean isMySql4 = (conn->conn->server_version[0] > '3');

- Use livelists to construct deleted accession tables.  Since this doesn't
  give an explicit status, only the latest one can be used, since
  surpressed entries can come back to life. This also doesn't work for
  refseq,but could be constructed by downloading the cummulative
   ftp://ftp.ncbi.nih.gov/genbank/livelists

- TPA records are not in downloads, where are they?  e.g. BK000688

- make download file of xeno sequences, peptides, run on download server

- *** bin and start indices are not build for all_mrna, all_est when
  chr* tables are not build.  Also, verify refSeq tables.

Warning: NM_012577: malformed RefSeq CDS: join(1,2..633)
Warning: NM_181796: malformed RefSeq CDS: join(76,77..708)

- Make this a warning:
   modDate for Y13116 is release (881568000) is before one in database (1047369600)
  so that suppressed entries will not be ignored when unsuppressed

- all_mrna, all_est don't need bin index if per chrom tables are being built.

- add verification of refSeqSummary to gbSanity

-  DM1 has 17 NR_* RNAs list because molecule is set to mRNA; report to
   NCBI.

- make table of db_xref so we can link to flybase, etc (low-pri).

- rawPsl files don't get migrated.

- take over creation of mrnaRefseq table.

- align.none file stayed around on failed align,. shpould purge up front.

- Need to make organism category more general, e.g.  D. melanogaster mRNA/EST
  tracks for D. pseudoobscura, as a substitute for D.pseudo's native mRNA/EST,
 
- mm4 NM_009980 was suppressed on Oct 5th.  It should have been removed from
  the ~Oct 25th release of genbank 138

- browser doesn't display mrnas/ests if nochrom loads are specified.

- move all genome specific config to genbank.conf, including gbBlat stuff.

- jk on more orgCat: Maybe:   native, close, medium, far
  where we use untranslated blat for native and close,  translated
  blat for medium, and some hypothetical more sensitive something for
  far.  For backwards compatibility if it's an issue we can make
  xeno a synonym for medium.

  - allow multiple categories end up in a single track 
  - also generalize splitting of data from a single track,
     like intronEst

- riken CDS is getting lost.

- looks like xenoRef* doesnt get delete with -deleteList.  

- sort tables on initial load.

- add refseqMrna table (or something like that) to map refseqs to the
  sequecnes they were derived from.

- should alignments be done with overlapping regions instead of 
  on contig boundries??

- delete gbDelete_tmp in gbLoadRna even if not used, as it sometimes
  hangs around.

- the cantSequence+fullLenght ones are loaded on mgc because it doesn't check the
  cantSequence field.  weird, this maybe a problem on their end??? 

- investigate complaints on galGal2 that are not really errors.

- add refseq summary to refseq details.
- refseq details only allows access to seq for first alignment.

- refseq proteins out-of-sync in gbSeq (joiner check).

    Error: 4882 of 21588295 elements of key gbSeq.acc are not in 'full' 
    rn3.gbStatus.acc line 527 of all.joiner
    Example miss: NP_036620
    - gbStatus doesn't have proteins.

- load XM_* and `dna based genes' from refseq.

- save protein product acc for mrnas so it can be search
- save GI numbers.
- make est downloads
- add mechanism to specify > 4gb tables for panTro xenoEst, etc
    alter table panTro1.xenoEst MAX_ROWS = 65527380 AVG_ROW_LENGTH=130;

- for scafold-based genomes (ci1), the tName index is not big enough.
  - maybe put some sanity checks of indices when loading

- would be nice to check for stop file in script rather than after starting load

- use 2bit instead of fasta for sequences; looks like ~75% size reduction.

- mod date in mrna table is off by one month (early); same with gbidx files.
  appears to be problem in gbff parser, as this problem exists in
  non-incremental databases.  However some weird behavior on 
  last day of month:
  rn3          AF076856         1998-10-01        31-OCT-1998
  rn2              "            1998-09-31        31-OCT-1998

  
-  gbSanity should be smarter about updates that have not been loaded, or
   at least a better error message, not:
      Error: danRer1: CO400007: not in mrna table, referenced in gbIndex
   also include update in gbIndex...

- refFlat geneName not updated
Error: danRer1: gene danRer1.refFlat:1402 NM_198363 chr3: NM_198363 geneName "tob1" does not match refLink name "tob1a"
Error: danRer1: gene danRer1.refFlat:1431 NM_199678 chr3: NM_199678 geneName "zgc:65984" does not match refLink name "hmox1"
Error: danRer1: gene danRer1.refFlat:1731 NM_131823 chr6: NM_131823 geneName "iro1" does not match refLink name "irx1b"
Error: danRer1: gene danRer1.refFlat:1853 NM_131824 chr7: NM_131824 geneName "iro5" does not match refLink name "irx5a"
Error: danRer1: gene danRer1.refFlat:1854 NM_131267 chr7: NM_131267 geneName "iro3" does not match refLink name "irx3"

- need to be able to incrementally add/remove native/xeno orgCats.

-  multi-line /product= are truncated to the first line:
   NM_033227 NM_021201 NM_033227 NM_021201 NM_003808 NM_052901 NM_175739

- need to be able to have reload.acc local to a specify database

- is all_mrna index (tName(8),tEnd) really useful?
  NOTE: mysql will use first column of multicolumn index,
  so one index of tName,tStart,tEnd maybe all that is needed

- add smarter poly-As triming

- refseq protein version can be different than mrna version; not sure
  we track this.

- lower number of retries on para make, or have and a smarter option that
  doesn't retry if a certain threshold of jobs has crashed.

- NM_005541 split into two pieces because it has a gap in an intron
       -human chr1 uses 362M for blat
      - allow for small in windows, since these can be in introns
      - windows don't really have to be a multiple of overlap, can
        have relatively large windows with small overalp.

- include version number in download fasta.

- add a weekly analyze table.

- could the command line parsing be more integrated with genbank.conf?
  so it becomes an override mechanism.

- DDC  NM_000790.2 has definition line truncated (or maybe product line).

- convert over to using pipeline module.  Hiram needed pipeline write 
  from stdout, so add function do to this.  Also add unit test.

- should TPA mrnas be included in genbank track.

- why is this not aligned
      U01833  ds-mRNA    linear   PRI 03-FEB-1995
  
- from gbAlignSetup: tmp job file exists: suggest course of action here.

- current mondo job distribution runs longest jobs first.
  - need to consider genome sequence size when partationing jobs.

- if ignore.idx is modified between gbAlignGet and gbAlignFinish, gbAlignFinish
  will get an error.

- Daniela asks (2005/01/03):
    P.S.  Are you planning to map the laevis clones against the tropicalis
    genome as well?  The issues of paralogs as well as the fairly high
    difference in sequence would make this a much more complicated, but I think
    useful.

- LocusLink to Entrez Gene link conversion.  Change refLink table, maybe
  include both gene id and locus id:

      /db_xref="GeneID:3952"
      /db_xref="LocusID:3952"
  drop old refseq build programs from tree.

- dna based ones: 
      - select * from kgXref where kgID like "NM_%" limit 10;
        NM_000050 -> P00966 swiss Cross-references ->  L00084 is dna

      - others
       SPRY3:
          AJ271735 is a DNA, thus it was not included in our current build process.
          AF041038 seems to cover only part of SPY3_HUMAN.


        use kgHg17BTemp
            locus2Acc0;
            locus2Ref0
        use proteins;
        describe hugo;

   KG gene lookup:
      start with swissprot accession (spAcc)
      spAcc lookup gbAccs in swissProt
      if any of gbAccs in mrna table, then done (on to blat, etc)
      
      (see dnaGene.c)
      foreach refAcc in locus2Ref:
            locusId = locus2Ref[refAcc].locusId
            lookup locusId in locus2acc0 with seqType='g', 
            gbAcc = first of these, now have a (refAcc,gbAcc) pair
            lookup refAcc in proteins.hugo, 
            spIdHugo = proteins.hugo.swissprot

            # need to get display id
            lookup spIdHugo in proteins.spXref 
            spDisplayId = proteins.spXref.displayID

            if spDisplayId == NULL
               # look at all matching spxref2
               foreach gbAcc in locus2Acc0 with locusId:
                     gbAcc in proteins.spXref.extAC
                     spDisplayId = spXref.display.ID, break
            if spDisplayId != NULL:
               get refGene, output refGene + spID + spDisplayId
            

       spAcc/spDisplayId, alignment

- some entries may have multiple gene ids (e.g. NM_066958)

- tmp load files should be removed first or they get write permission
  errors if loaded by a different user.

- mrna cell column comes from /cell_line, but some have /cell_type,
  eg AY911673.

- keywords seem to be somewhat usless; not searched by hgFind.
   - "htc" has  378885 hits
   - field is not split by ;, . is not removed
   - see NCBI genbank doc on why this is mostly useless

- gbDbload should contain orgCat as well, otherwise, adding a new orgCat
  may not be detected as out-of-date.

- should be able to have a lift with only some of the chroms.

- issues when trying to detect coding special cases, see tools/checkSpecial.awk
  for a program to scan the ra file.  Need to report these:

    NM_002085: selcys transl_excep listed as OTHER
    NM_020860 no transl_except:        non-AUG initiation codon
    NM_033013 on NMspecialcases (non-AUG), but not flag; maybe other isoform
    NM_002537	ribosomal slippage (translational frameshift)	reviewed
                    no /exception
    NM_016178	ribosomal slippage (translational frameshift)	reviewed
                    no /exception
- need to deal with 
  NM_133378 TTN, which just has a /codon_start at CDS, also

            http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html
  indicates joins should be used for ribosomal slippage.

- why is AY073210  DNA in genbank (source of NM_001001809),
  found with mgc orthoGeneMap refSeqGetDerived
  
- CDS parser blows it:
 AK002774	Predictor,Longest-ORF)"
 AK003936	Predictor,Longest-ORF)
 AK005344	Predictor,Longest-ORF)"
 AK005444	Predictor)"

- gbLoadMRna procuces message that is not in ignore.idx format:
  entry S79794 2005-07-02 organism previously specified as "Rattus sp." (xeno), is specfied as "Mus musculus" (native) in ./data/processed/genbank.148.0/daily.0803/mrna.gbidx, add one to ignored.idx

- refseqs endup in mrnaOrientInfo, which causes grief for the joiner

- this query is slow..
    explain SELECT acc,version,modDate,type,srcDb,orgCat,gbSeq,numAligns,seqRelease,seqUpdate,metaRelease,metaUpdate,extRelease,extUpdate,time FROM gbStatus WHERE (type='EST')  AND  (srcDb='GenBank')  AND  (acc LIKE 'AB');
    +----------+-------+---------------------------+------+---------+------+------+-------------+
    | table    | type  | possible_keys             | key  | key_len | ref  | rows | Extra       |
    +----------+-------+---------------------------+------+---------+------+------+-------------+
    | gbStatus | range | PRIMARY,acc,typeSrcDbAcc2 | acc  |      12 | NULL |    1 | Using where |
    +----------+-------+---------------------------+------+---------+------+------+-------------+
    maybe add a first two character of the acc column???

- 2bit for cDNA needs:
  - fast open and access (disk-based hash)
  - > 4gb size
  - expandable
  - live rsync
  - support incremental additions with rsync
  - store est direction (??)
  - store length of polyA tail/polyT head (??)

- Repeat filtering:
  - Jen suggest using amount of clone remaining after removing repeats rather
    than just fraction of repeats matching.  

- genome categories:

  - finished (human)
  - well ordered whole genome shotgun (dog, mm2-mm6)
  - low coverage (< 4x), lots of contigs., N50 scaffol size < 1mb
  - not well ordered, cow! (baylor stuff)

- make sure blat still works with file of specs
- apply to nibs
- update ussage message in polyInfo pslIntronsOnly
- min cover too high ***, max repeats too low.

- should be able to detect initial load, rather than require argument.

- got this error during load
  mySQL error 1050: Table 'xenoEst_new_tmp' already exists
  but for now completely disabled the copy-deletion optimization
  due to slowing down the servers.

- perhaps should include NR_ entries in refgene:
   this was NM_031926.1, replaced by NR_001541.1

-  xeno refseq NM_167334 on droAna2 had 41 poor alignments, some with huge
   introns.  Might need to add maxIntro.

- mm7 NM_207243 has a bunch of repetitive alignments to the same location with  3968, plus
  one longer one.  Should prqobably have ability to toss weird overlaps where thee is one 
  better alignment.

- library table has truncated descriptions if entries are multi-line.

- some clones, (e.g. AA778675 ) have id in the form: /clone="1049033",
  can these be assumed to be IMAGE??  They are not currently loaded into
  the imageClone table.

- gbCdnaIfno.direction  column defaults to 5' if it can't be figured out.
  It should result to unknonw; could also check the sequence for polyA/T.

- If sequence version changes in the same day, it's not updated:

    DQ230327
    DQ230328
    DQ230329
    DQ272519

-  NM_005577 multiple overlapping alignments due to repeative sequence;
   can't be better filters.

- gbExtFile YP_ proteins are included from daily updates

- danRer1 NP_ files got dropped from gbExtFile, but not updated in
  gbSeq *******

- YP_ sequences get added to gbSeq, but are not used.

- add bin column to genepreds

- do experiments to determine effect of incremental additions 
  leaving est tables unsorted (see ~/compbio/genbank/gbTblOrder/).


- repeat filter axes ESTs, etc for endogenous retroviruses; see
         ERVWE1  NM_014590
          hg16:chr7:91710112-91711730
          hg17:chr7:91742734-91744352

- increase mrnaToGene merge size

- change gbAlignStep, etc to allow options after the databases, since
  this is so familiar to people.

- gbLoadRna -drop seems to be broken in some case (hgw8 tests)

- increase mrnaToGene merge size.

- DQ075263 - two sequence version on same mod date.

- hg17 NM_182901, example of UTR aligning in a different place

- perhaps should have separate HGC division mRNAs track?
  - are these all being included now??

- way to exclude annotated linker/vector from alignment

- use taxoniy tree to exclude sequences too distant to align

- double check mgcFull against /clone="MGC:4744 IMAGE:3536686"

- MGC tracks build with strict, means that new genome requires trackDbLocal
  to be reloaded (rename this to trackDbMgc).

- MGC Cow generates No CDS defined: no protein translation for BC119821
  when clicking on "Reference genome protein".  This is due to trying to
  generate proteins for both alignments, one is partial.  Should only generate
  alignment of current protein.

- alignments that join tandem gene dups:
    mm8:NM_178882.2
    hg18:chr22:37,706,098-37,778,547
    hg18 BC029540.1 chrX:119,914,829-119,947,319
    mm9:chr7:91,891,203-92,412,482 (really bad)


- dropped by repeat masking.  Keep alignments to repeats only if there are
  just a small number of them
    hg18:AA309293,CX783937

   - Perhaps you could add that the alignment extends over
     the vast majority of the cDNA as well?   (JK)

- create problem library table, and color cDNAs and provide details
  based on it
   - Athersys RAGE Library
   - Other RAGE libraries?? 
           see http://www.nature.com/nbt/journal/v19/n5/full/nbt0501_440.html
   - Invitrogen CR mRNAs (http://fulllength.invitrogen.com/ in COMMENT)
   - JK: The ORESTES libraries (which I think, like the RAGE are
     just ESTs) should be taken with a large grain of salt.
       these are probably ok,see http://www.nature.com/tpj/journal/v2/n3/full/6500103a.html

   - how to select for above
        

- MGC details page like to CCDS shouldn't have error if ccdsMgcMap
  not built (or maybe a more informative message, nice to have link).
    - if one goes from CCDS details to mgc, there is no link back to    
      ccds.

- Include TPA data

- from Kim on RefSeq suppressed:
    Also, glad you mentioned the suppressed/replaced tracking --we are
    reporting suppressed and secondary accessions in two locations:

    1. reported per refseq release cycle in refseq/release/release-catalog/
    release19.removed-records.gz

    2. reported for a smaller subset of taxids (those in my pipeline) on a
    weekly basis in the refseq/special_requests/ directory. I think this is
    pretty close to a comprehensive report (for NMs) for these taxids.

    a) taxid2speciesname --the taxids in scope for this weekly reporting
    b) secondary_public --secondary accessions; secondary to primary
    accession correspondence (secondary data listed first)
    c) secondary_suppressed_final --secondary to primary correspondence,
    where the primary accession has been permanently suppressed and is not
    likely to become public again
    d) secondary_suppressed_temporary --secondary to primary correspondence,
    where the primary record has been temporarily suppressed and may be
    reinstated at a future date
    e) suppressed_final  --permanently suppressed
    f) suppressed_temporary --temporarily suppressed, may become public
    again

- look at PASA alignment stuff to see if there are any good ideas.
  http://pasa.sourceforge.net/

- keywords should be split at `;'

- write doc on how to reproduce the alignment
    blat -noHead -repeats=lower -ooc=hg18.ooc -q=rna -fine hg18.2bit mrna.fa mrna.rawPsl
    faPolyASizes mrna.fa mrna.polya
    sort -k 10,10 mrna.rawPsl | pslCDnaFilter -minId=0.96 -minCover=0.25 -localNearBest=0.005 -minQSize=20 -minNonRepSize=16 -ignoreNs -bestOverlap -polyASizes=mrna.polya stdin mrna.psl

- add -polyT trimming, -q=mrna to EST alignments !!!!!!!!!!!!

- instead of creating a *.jobs file per job, create one big file and
  have the offset of the record a parameter to the gbBlat

- generate error if attempt to load per chromosome tables with
  a large number of `chromosomes'.

- add optimizer script that does analyze tables and sorts tables.

- why does ncbi misc_diff on MGC report polyA being one base less
  than it actually is??

- is too much being lost from xeno alignments by throwing out PSLs with
  overlaping blocks?  Maybe fix psls rather than discard?

- BC050659 misc_diff was not in ra file, thus not loaded.

- can we use less conservative *SizeLoose() function for polyA determination
  in pslCDNAFilter??? (need both sequence and alignment).

- display MGC genes as some kind of hybrid of genePred and psl.

- change refseq to not display validated as dark

- mgcGenes search doesn't appear to work.

-JK: In the global near best are we using the pslCalcMilliBad?  If
  so we may have already reached the limit of it's one part per 1000
  precision.   

- gbSanity does nothing if there are no alignment files, it should complain.

- remove default alignment categories from genbank.conf
- force occ to be in same directory as cluster genome.
- change mgc/orefome tracks to only update if there is new data.

- tRNAs seems to be missing from the pipeline K00167, it blat by
  the web server, but is wacked by the repeat content filter.

- synthetic human mRNAs end up labeled `synthetic' in Xeno tracks.

- perhaps user lower identity and higher coverage criteria for
  ESTs, as they are noise.  Maybe apply only to spliced ESTs.

- genePredToGtf breaks on dm2 refGene.txt, gets start stop codons wrong
  need to create 5UTR/3UTR features instead of just UTR

- have gbSanity generate an error if nothing checked.

- hg17 NM_001502 - CDS updated, but refgene was not updated.

- detect missing files (ooc, etc) as early as possible.

- add details of stop codon to ORFeome clone details:
    "full-length ORF with stop codon" 
    or "full-length ORF without stop codon"

- need source build/updated problems that builds both i386 and x86_64

- location information not provided in CCDS description (e.g. strand);

- genePredCheck -db=danRer5 refGene
  Error: refGene:5212: NM_001045301 no exonFrame on CDS exon 0
  checked: 12649 failed: 1
  genePredCheck -db=danRer5 mgcGenes
  Error: mgcGenes:4961: BC117652 no exonFrame on CDS exon 0
  checked: 12662 failed: 1

- RefSeq details pages incorrectly has `CDS:' completeness from 
  comment section, when it is, in fact mRNA completeness/

- make selection of per chrom or not per chrom tables automatic,
  or just get rid of them.

- regenerate upstream/downstream sequences when refGene is updated.

- mm9 alignment of EST CJ046163 chooses a less optimal, unspliced alignment
  over a spliced alignment.  Make splice sites part of the scoring

- deal with CDS complement() annotations for cDNA: AF116618, AF116676

- pslCDNAFilter: look at NM_012196.1/ CCDS35257.1 -  multiple overlapping
  NM alignments,  none match CCDS


- deal with orfeomes like mm  BC148437 that have two sources.


- filter TSA mRNAs (e.e. EU212998) per Kim, these are assembled ESTs without genomic
  context and not reliable.  in keyword.

- hg18:  AF361221 fusion protein, but only 1/2 kept by the alignment filter.

- dm3:AF181625, dm3:AF067153 trans-spliced genes were part of transcript is
   discarded by filtering.

- hg18 BC166696 - looks like linker should have been removed.
 misc_feature    4034..4046
                     /note="linker at the 3' end of the ORF that includes stop
                     codon TAG and Sfi I linker"

- hg18:chr15:63,701,323-63,735,652d /  SLC24A1 NM_004727.2
  - genome has 1-base insertion (error); frame rendered is wrong
    due to micro-indel merging.

- shows ESTes with Author: HCGP http://www.ludwig.org.br/ORESTES be in gbWarn

- mRNA BX161433 flagged invitroNorm in gbWarn; BX356367 seems to be from
  same clone and should be flagged

- MGC Clone BC065513.1 hg18 mapped to chr6, plus 2 hap chroms, but RefSeq CDS similarity
  shows only haps

- MGCs mapped only to haplotypes chrosm, although refseqs are mapped to both hap and ref chroms
   BC021708
   BC003069
   BC125044
   BC013184
   BC012106

- very messy loci: chr22:20,715,572-21,595,082

- pslCDNAFilter can implement pslCDnaGenomeMatch type logic using pslx output of blat,
  then discard sequences.

- NM_016429.2 has a 3 non-contiguous insertions in the mRNA near start of CDS.
  This gets the frame confused in the browser display.
  

- APOBEC - 7 gene tandem cluster, some broken cases.

- hg18: EST BG217227 not in gbWarn, but is an Athersys RAGE

- hg19: chr10:135,483,678-135,488,584 DUUX4 and others in weird, tangled tandem dup, maybe
  retro copies

- rn4:chr1:78,970,961-79,025,593 - nasty xeno refseq alignments in tandem dup.

- hg19:chr1:145,103,933-145,103,966 CR624978 VitroGen/Genoscope modified to match
  stop codon in the genome.

- problematic NR alignment filtering do to repeat masking.  Poor alignments
  chosen NR_031679 NR_031644 NR_030317.  

- hg19 NM_001001432 block 3 11 base exon is misaligned generating
  an invalid introns due to duplication of the exon.  NM_000364 gets it
  right due to being constrained by other exons.
