banner



How To Upload Gff File To Ucsc Genome Browser

  Frequently Asked Questions: Data File Formats

General formats:
  • Axt format
  • BAM format
  • BED format
  • BED detail format
  • bedGraph format
  • bigBed format
  • bigWig format
  • Chain format
  • GenePred table format
  • GFF format
  • GTF format
  • MAF format
  • Microarray format
  • Net format
  • Personal Genome SNP format
  • PSL format
  • VCF format
  • WIG format
ENCODE-specific formats:
  • ENCODE broadPeak format
  • ENCODE gappedPeak format
  • ENCODE narrowPeak format
  • ENCODE pairedTagAlign format
  • ENCODE peptideMapping format
  • ENCODE RNA elements format
  • ENCODE tagAlign format
Download only formats:
  • .2bit format
  • .fasta format
  • .fastQ format
  • .bill format

Return to FAQ Table of Contents



  BED format

BED format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields. The number of fields per line must exist consistent throughout any single set of data in an note rails. The order of the optional fields is binding: lower-numbered fields must e'er exist populated if college-numbered fields are used.

If your data set is BED-like, but it is very big and you would like to keep it on your ain server, you should apply the bigBed data format.

The first three required BED fields are:

  1. chrom - The name of the chromosome (e.thousand. chr3, chrY, chr2_random) or scaffold (due east.k. scaffold10671).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the characteristic in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the kickoff 100 bases of a chromosome are defined equally chromStart=0, chromEnd=100, and span the bases numbered 0-99.

The 9 additional optional BED fields are:

  1. name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open up to full display manner or directly to the left of the item in pack way.
  2. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value volition determine the level of gray in which this feature is displayed (higher numbers = darker greyness). This tabular array shows the Genome Browser's translation of BED score values into shades of gray:
    shade
    score in range ≤ 166 167-277 278-388 389-499 500-611 612-722 723-833 834-944 ≥ 945
  3. strand - Defines the strand - either '+' or '-'.
  4. thickStart - The starting position at which the feature is fatigued thickly (for instance, the offset codon in cistron displays).
  5. thickEnd - The ending position at which the characteristic is drawn thickly (for case, the cease codon in gene displays).
  6. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value volition determine the brandish colour of the information contained in this BED line. NOTE: It is recommended that a uncomplicated color scheme (eight colors or less) be used with this aspect to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
  7. blockCount - The number of blocks (exons) in the BED line.
  8. blockSizes - A comma-separated listing of the block sizes. The number of items in this listing should correspond to blockCount.
  9. blockStarts - A comma-separated list of cake starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

Example:
Here's an example of an annotation track that uses a complete BED definition:

                      rail proper noun=pairedReads description="Clone Paired Reads" useScore=ane chr22 1000 5000 cloneA 960 + grand 5000 0 ii 567,488, 0,3512 chr22 2000 6000 cloneB 900 - 2000 6000 0 two 433,399, 0,3601                    

Case:
This case shows an annotation track that uses the itemRgb attribute to individually color each data line. In this track, the color scheme distinguishes between items named "Pos*" and those named "Neg*". Run into the usage note in the itemRgb description in a higher place for color palette restrictions. Annotation: The track and data lines in this case have been reformatted for documentation purposes. This example can be pasted into the browser without editing.

                      browser position chr7:127471196-127495720 browser hide all track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On" chr7    127471196  127472363  Pos1  0  +  127471196  127472363  255,0,0 chr7    127472363  127473530  Pos2  0  +  127472363  127473530  255,0,0 chr7    127473530  127474697  Pos3  0  +  127473530  127474697  255,0,0 chr7    127474697  127475864  Pos4  0  +  127474697  127475864  255,0,0 chr7    127475864  127477031  Neg1  0  -  127475864  127477031  0,0,255 chr7    127477031  127478198  Neg2  0  -  127477031  127478198  0,0,255 chr7    127478198  127479365  Neg3  0  -  127478198  127479365  0,0,255 chr7    127479365  127480532  Pos5  0  +  127479365  127480532  255,0,0 chr7    127480532  127481699  Neg4  0  -  127480532  127481699  0,0,255                                          
Click here to display this runway in the Genome Browser.

Case:
It is also possible to color items by strand in a BED rails using the colorByStrand attribute in the track line every bit shown beneath. For BED tracks, this attribute functions only for custom tracks with 6 to 8 fields (i.e. BED6 through BED8). NOTE: The track and information lines in this case accept been reformatted for documentation purposes. This example tin exist pasted into the browser without editing.

                      browser position chr7:127471196-127495720 browser hide all runway proper name="ColorByStrandDemo" clarification="Color past strand demonstration" visibility=2 colorByStrand="255,0,0 0,0,255" chr7    127471196  127472363  Pos1  0  + chr7    127472363  127473530  Pos2  0  + chr7    127473530  127474697  Pos3  0  + chr7    127474697  127475864  Pos4  0  + chr7    127475864  127477031  Neg1  0  - chr7    127477031  127478198  Neg2  0  - chr7    127478198  127479365  Neg3  0  - chr7    127479365  127480532  Pos5  0  + chr7    127480532  127481699  Neg4  0  -                                          
Click hither to display this rail in the Genome Browser.


  bigBed format

The bigBed format stores annotation items that can either be simple, or a linked drove of exons, much every bit bed files do. BigBed files are created initially from bed type files, using the program bedToBigBed. The resulting bigBed files are in an indexed binary format. The principal reward of the bigBed files is that only the portions of the files needed to display a detail region are transferred to UCSC, and so for large data sets bigBed is considerably faster than regular bed files. The bigBed file remains on your web accessible server (http, https, or ftp), not on the UCSC server.

Click hither for more information on the bigBed format.



  BED detail format

This is an extension of BED format. BED item uses the offset 4 to 12 columns of BED format, plus 2 additional fields that are used to heighten the track details pages. The first additional field is an ID, which can be used in place of the name field for creating links from the details pages. The second additional field is a description of the detail, which can be a long description and tin can consist of html, including tables and lists.

Requirements for BED detail custom tracks are: fields must be tab-separated, "type=bedDetail" must be included in the track line, and the proper noun and position fields should uniquely describe items so that the right ID and description volition exist displayed on the details pages.

Example:
This example uses the first 4 columns of BED format, merely up to 12 may be used. Click here to view this runway in the Genome Browser.

rails name=HbVar type=bedDetail description="HbVar custom rails" db=hg19 visibility=3 url="http://globin.bx.psu.edu/cgi-bin/hbvar/query_vars3?display_format=page&mode=output&id=$$" chr11	5246919	5246920	Hb_North_York	2619	Hemoglobin variant chr11	5255660	5255661	HBD c.1 Thou>A	2659	delta0 thalassemia chr11	5247945	5247946	Hb Sheffield	2672	Hemoglobin variant chr11	5255415	5255416	Hb A2-Lyon	2676	Hemoglobin variant chr11	5248234	5248235	Hb Aix-les-Bains	2677	Hemoglobin variant                    


  bedGraph format

The bedGraph format allows brandish of continuous-valued data in track format. This display blazon is useful for probability scores and transcriptome data. This rail type is similar to the WIG format, but unlike the WIG format, information exported in the bedGraph format are preserved in their original state. This tin can exist seen on export using the table browser. For more than information near the bedGraph format, please see the bedGraph details page.

If y'all have a very large data gear up and you would like to keep it on your own server, yous should use the bigWig format.



  PSL format

PSL lines correspond alignments, and are typically taken from files generated by BLAT or psLayout. See the BLAT documentation for more details. All of the post-obit fields are required on each data line within a PSL file:

  1. matches - Number of bases that match that aren't repeats
  2. misMatches - Number of bases that don't match
  3. repMatches - Number of bases that match but are part of repeats
  4. nCount - Number of 'Due north' bases
  5. qNumInsert - Number of inserts in query
  6. qBaseInsert - Number of bases inserted in query
  7. tNumInsert - Number of inserts in target
  8. tBaseInsert - Number of bases inserted in target
  9. strand - '+' or '-' for query strand. For translated alignments, second '+'or '-' is for genomic strand
  10. qName - Query sequence name
  11. qSize - Query sequence size
  12. qStart - Alignment start position in query
  13. qEnd - Alignment cease position in query
  14. tName - Target sequence name
  15. tSize - Target sequence size
  16. tStart - Alignment start position in target
  17. tEnd - Alignment end position in target
  18. blockCount - Number of blocks in the alignment (a block contains no gaps)
  19. blockSizes - Comma-separated listing of sizes of each block
  20. qStarts - Comma-separated list of starting positions of each block in query
  21. tStarts - Comma-separated list of starting positions of each cake in target

Example:
Here is an instance of an annotation track in PSL format. Annotation that line breaks have been inserted into the PSL lines in this example for documentation display purposes. This example tin exist pasted into the browser without editing.

                                              browser position chr22:13073000-13074000 browser hide all runway name=fishBlats description="Fish BLAT" visibility=2 useScore=1 59 9 0 0 1 823 one 96 +- FS_CONTIG_48080_1 1955 171 1062 chr22     47748585 13073589 13073753 2 48,twenty,  171,1042,  34674832,34674976, 59 7 0 0 1 55 1 55 +- FS_CONTIG_26780_1 2825 2456 2577 chr22     47748585 13073626 13073747 ii 21,45,  2456,2532,  34674838,34674914, 59 seven 0 0 ane 55 i 55 -+ FS_CONTIG_26780_1 2825 2455 2676 chr22     47748585 13073727 13073848 2 45,21,  249,349,  13073727,13073827,                                          
Click here to display this track in the Genome Browser.

Be enlightened that the coordinates for a negative strand in a PSL line are handled in a special way. In the qStart and qEnd fields, the coordinates point the position where the query matches from the signal of view of the forrad strand, even when the match is on the opposite strand. Nonetheless, in the qStarts list, the coordinates are reversed.

Instance:
Here is a xxx-mer containing two blocks that align on the minus strand and 2 blocks that align on the plus strand (this sometimes tin happen in response to associates errors):

                                              0         1         2         3 tens position in query   0123456789012345678901234567890 ones position in query                ++++          +++++ plus strand alignment on query        --------    ----------      minus strand alignment on query    0987654321098765432109876543210 ones position in query negative strand coordinates 3         2         1         0 tens position in query negative strand coordinates  Plus strand:         qStart=12       qEnd=31       blockSizes=4,v       qStarts=12,26                     Minus strand:         qStart=4       qEnd=26       blockSizes=ten,8       qStarts=5,19                                          
Essentially, the minus strand blockSizes and qStarts are what yous would become if y'all contrary-complemented the query. However, the qStart and qEnd are non reversed. To convert one to the other:
                      Negative-strand-coordinate-qStart = qSize - qEnd   = 31 - 26 =  5      Negative-strand-coordinate-qEnd   = qSize - qStart = 31 -  4 = 27                    


  GFF format

GFF (General Characteristic Format) lines are based on the GFF standard file format. GFF lines accept nine required fields that must be tab-separated. If the fields are separated by spaces instead of tabs, the rail will not display correctly. For more data on GFF format, refer to http://www.sanger.ac.uk/resource/software/gff/.

If you would like to obtain browser information in GFF (GTF) format, please refer to Genes in gtf or gff format on the Wiki.

Here is a cursory description of the GFF fields:

  1. seqname - The proper noun of the sequence. Must be a chromosome or scaffold.
  2. source - The programme that generated this characteristic.
  3. feature - The proper name of this type of characteristic. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon".
  4. kickoff - The starting position of the feature in the sequence. The showtime base is numbered i.
  5. cease - The ending position of the feature (inclusive).
  6. score - A score between 0 and 1000. If the rail line useScore attribute is set to 1 for this note data prepare, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker grey). If at that place is no score value, enter ".".
  7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care).
  8. frame - If the feature is a coding exon, frame should be a number betwixt 0-2 that represents the reading frame of the first base of operations. If the feature is not a coding exon, the value should exist '.'.
  9. grouping - All lines with the aforementioned group are linked together into a single item.

Example:
Hither'south an example of a GFF-based track. This instance can be pasted into the browser without editing. Annotation: Paste operations on some operating systems will replace tabs with spaces, which volition result in an mistake when the GFF track is uploaded. Y'all can circumvent this problem past pasting the URL of the above example (http://genome.ucsc.edu/goldenPath/assistance/regulatory.txt) instead of the text itself into the custom annotation rails text box. If you encounter an fault when loading a GFF runway, check that the information lines contain tabs rather than spaces.

                                              browser position chr22:10000000-10025000 browser hide all track name=regulatory description="TeleGene(tm) Regulatory Regions" visibility=ii chr22  TeleGene enhancer  10000000  10001000  500 +  .  touch1 chr22  TeleGene promoter  10010000  10010100  900 +  .  touch1 chr22  TeleGene promoter  10020000  10025000  800 -  .  touch2                                          
Click here to display this track in the Genome Browser.


  GTF format

GTF (Cistron Transfer Format) is a refinement to GFF that tightens the specification. The outset viii GTF fields are the same equally GFF. The group field has been expanded into a list of attributes. Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following aspect by exactly one space.

The attribute list must begin with the ii mandatory attributes:

  • gene_id value - A globally unique identifier for the genomic source of the sequence.
  • transcript_id value - A globally unique identifier for the predicted transcript.

Case:
Here is an example of the ninth field in a GTF data line:

gene_id "Em:U62317.C22.6.mRNA"; transcript_id "Em:U62317.C22.half dozen.mRNA"; exon_number one

The Genome Browser groups together GTF lines that have the same transcript_id value. It just looks at features of blazon exon and CDS.

For more than information on this format, see http://mblab.wustl.edu/GTF2.html.

If you would like to obtain browser information in GTF format, please refer to Genes in gtf or gff format on the Wiki.



  MAF format

The multiple alignment format stores a series of multiple alignments in a format that is easy to parse and relatively easy to read. This format stores multiple alignments at the DNA level between entire genomes. Previously used formats are suitable for multiple alignments of unmarried proteins or regions of DNA without rearrangements, but would require considerable extension to cope with genomic issues such equally forward and reverse strand directions, multiple pieces to the alignment, and so forth.

General Structure

The .maf format is line-oriented. Each multiple alignment ends with a blank line. Each sequence in an alignment is on a single line, which can get quite long, but at that place is no length limit. Words in a line are delimited by any white space. Lines starting with # are considered to exist comments. Lines starting with ## can be ignored past most programs, simply incorporate meta-information of one form or another.

The file is divided into paragraphs that terminate in a blank line. Within a paragraph, the starting time word of a line indicates its blazon. Each multiple alignment is in a split paragraph that begins with an "a" line and contains an "s" line for each sequence in the multiple alignment. Some MAF files may contain other optional line types:

  • an "i" line containing data nearly what is in the aligned species DNA before and afterward the immediately preceding "s" line
  • an "e" line containing information about the size of the gap between the alignments that bridge the current cake
  • a "q" line indicating the quality of each aligned base of operations for the species

Parsers may ignore any other types of paragraphs and other types of lines within an alignment paragraph.

Custom Tracks

The first line of a custom MAF track must be a "rail" line that contains a name=value pair specifying the track proper name. Here is an instance of a minimal track line:

                                                                        track name=sample                                          
The post-obit variables can be specified in the track line of a custom MAF:
  • name=sample - Required. Proper noun the track
  • description="Sample Runway" - Optional. Gives a long proper name for the rail
  • frames=multiz28wayFrames - Optional. Tells the browser which table to catch the gene frames from. This is commonly associated with an N-fashion alignment where the proper name ends in the string "Frames"
  • mafDot=on - Optional. Use dots instead of bases when bases are identical
  • visibility=dense|pack|total - Optional. Sets the default visibility fashion for this rail.
  • speciesOrder="hg18 panTro2" - Optional. White-infinite separated list specifying the gild in which the sequences in the maf should be displayed.
The 2nd line of a custom MAF track must be a header line every bit described beneath.

Header Line

The first line of a .maf file begins with ##maf. This word is followed by white-space-separated variable=value pairs. There should exist no white infinite surrounding the "=".

                                                                        ##maf version=1 scoring=tba.v8                                                                  
The currently defined variables are:
  • version - Required. Currently prepare to 1.
  • scoring - Optional. A name for the scoring scheme used for the alignments. The current scoring schemes are:
    • bit - roughly corresponds to blast flake values (roughly 2 points per adjustment base minus penalties for mismatches and inserts).
    • blastz - blastz scoring scheme -- roughly 100 points per aligning base.
    • probability - some score normalized betwixt 0 and 1.
  • program - Optional. Proper name of the programme generating the alignment.
Undefined variables are ignored by the parser.

Alignments Parameter Line

The second line displays the parameters that were used to run the alignment programme.

                                                                        # tba.v8 (((man chimp) baboon) (mouse rat))                                          

Alignment Block Lines (lines starting with 'a' -- parameters for a new alignment block)

                                                                        a score=23262.0                                          
Each alignment begins with an 'a' line that prepare variables for the entire alignment block. The 'a' is followed by name=value pairs. At that place are no required proper noun=value pairs. The currently divers variables are:
  • score -- Optional. Floating point score. If this is nowadays, information technology is good practice to also ascertain scoring in the first line.
  • pass -- Optional. Positive integer value. For programs that do multiple pass alignments such equally blastz, this shows which pass this alignment came from. Typically, pass 1 volition find the strongest alignments genome-wide, and pass 2 volition find weaker alignments between ii first-pass alignments.

Lines starting with 'due south' -- a sequence within an alignment block

                                                                        due south hg16.chr7    27707221 xiii + 158545518 gcagctgaaaaca  s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca  southward baboon         249182 13 +   4622798 gcagctgaaaaca  s mm4.chr6     53310102 thirteen + 151104725 ACAGCTGAAAATA                                                                  
The 'southward' lines together with the 'a' lines define a multiple alignment. The 's' lines have the following fields which are divers by position rather than name=value pairs.
  • src -- The name of one of the source sequences for the alignment. For sequences that are resident in a browser associates, the course 'database.chromosome' allows automatic creation of links to other assemblies. Not-browser sequences are typically reference by the species name lonely.
  • start -- The starting time of the aligning region in the source sequence. This is a zippo-based number. If the strand field is '-' then this is the starting time relative to the reverse-complemented source sequence (see Coordinate Transforms).
  • size -- The size of the adjustment region in the source sequence. This number is equal to the number of not-nuance characters in the alignment text field below.
  • strand -- Either '+' or '-'. If '-', so the alignment is to the opposite-complemented source.
  • srcSize -- The size of the entire source sequence, not just the parts involved in the alignment.
  • text -- The nucleotides (or amino acids) in the alignment and whatever insertions (dashes) besides.

Lines starting with 'i' -- information about what's happening before and later on this block in the aligning species

                                                                        s hg16.chr7    27707221 13 + 158545518 gcagctgaaaaca  s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca  i panTro1.chr6 Due north 0 C 0  s birdie         249182 13 +   4622798 gcagctgaaaaca  i baboon       I 234 n 19                                          

The 'i' lines contain information most the context of the sequence lines immediately preceeding them. The post-obit fields are divers by position rather than name=value pairs:

  • src -- The name of the source sequence for the alignment. Should be the same every bit the 's' line immediately in a higher place this line.
  • leftStatus -- A graphic symbol that specifies the relationship between the sequence in this block and the sequence that appears in the previous block.
  • leftCount -- Unremarkably the number of bases in the aligning species between the offset of this alignment and the finish of the previous 1.
  • rightStatus -- A graphic symbol that specifies the relationship between the sequence in this block and the sequence that appears in the subsequent block.
  • rightCount -- Usually the number of bases in the aligning species between the end of this alignment and the starting time of the next 1.

The status characters can be i of the following values:

  • C -- the sequence earlier or after is contiguous with this block.
  • I -- at that place are bases betwixt the bases in this block and the one before or subsequently it.
  • N -- this is the first sequence from this src chrom or scaffold.
  • northward -- this is the beginning sequence from this src chrom or scaffold but it is bridged by some other alignment from a different chrom or scaffold.
  • M -- there is missing information earlier or subsequently this block (Ns in the sequence).
  • T -- the sequence in this block has been used earlier in a previous block (likely a tandem duplication)

Lines starting with 'e' -- data nearly empty parts of the alignment cake

                                                                        s hg16.chr7    27707221 xiii + 158545518 gcagctgaaaaca  east mm4.chr6     53310102 13 + 151104725 I                                          

The 'e' lines indicate that at that place isn't aligning DNA for a species but that the current block is bridged by a chain that connects blocks before and afterward this cake. The following fields are defined past position rather than proper noun=value pairs.

  • src -- The proper name of one of the source sequences for the alignment.
  • showtime -- The start of the non-adjustment region in the source sequence. This is a zip-based number. If the strand field is '-' and so this is the start relative to the reverse-complemented source sequence (meet Coordinate Transforms).
  • size -- The size in base pairs of the not-aligning region in the source sequence.
  • strand -- Either '+' or '-'. If '-', then the alignment is to the reverse-complemented source.
  • srcSize -- The size of the entire source sequence, not simply the parts involved in the alignment. alignment and whatever insertions (dashes) as well.
  • status -- A grapheme that specifies the relationship betwixt the non-aligning sequence in this block and the sequence that appears in the previous and subsequent blocks.

The condition grapheme tin be one of the following values:

  • C -- the sequence before and after is contiguous implying that this region was either deleted in the source or inserted in the reference sequence. The browser draws a single line or a '-' in base mode in these blocks.
  • I -- at that place are non-aligning bases in the source species between chained alignment blocks earlier and after this block. The browser shows a double line or '=' in base mode.
  • Thou -- there are non-adjustment bases in the source and more than 90% of them are Ns in the source. The browser shows a stake yellow bar.
  • n -- there are non-adjustment bases in the source and the adjacent aligning block starts in a new chromosome or scaffold that is bridged by a concatenation between still other blocks. The browser shows either a single line or a double line based on how many bases are in the gap between the bridging alignments.
Lines starting with 'q' -- information about the quality of each aligned base for the species
                                                                        s hg18.chr1                  32741 26 + 247249719 TTTTTGAAAAACAAACAACAAGTTGG  s panTro2.chrUn            9697231 26 +  58616431 TTTTTGAAAAACAAACAACAAGTTGG  q panTro2.chrUn                                   99999999999999999999999999  s dasNov1.scaffold_179265     1474  7 +      4584 TT----------AAGCA---------  q dasNov1.scaffold_179265                         99----------32239---------                                                                  

The 'q' lines comprise a compressed version of the actual raw quality information, representing the quality of each aligned base for the species with a single character of 0-9 or F. The following fields are defined by position rather than name=value pairs:

  • src -- The name of the source sequence for the alignment. Should be the same as the 's' line immediately preceding this line.
  • value -- A MAF quality value corresponding to the aligning nucleotide acrid in the preceding 's' line. Insertions (dashes) in the preceding 's' line are represented by dashes in the 'q' line as well. The quality value tin be 'F' (finished sequence) or a number derived from the actual quality scores (which range from 0-97) or the manually assigned score of 98. These numeric values are calculated as:
    MAF quality value = min( flooring(actual quality value/v), 9 )
    This results in the following mapping:

    MAF quality value Raw quality score range Quality level
    0-8 0-44 Low
    nine 45-97 High
    0 98 Manually assigned
    F 99 Finished

A Simple Case

Here is a simple instance of a iii alignment blocks derived from five starting sequences. The start runway line is necessary for custom tracks, just should be removed otherwise. Repeats are shown as lowercase, and each block may have a subset of the input sequences. All sequence columns and rows must incorporate at least one nucleotide (no columns or rows that contain but insertions).

                                                                        track name=euArc visibility=pack ##maf version=i scoring=tba.v8  # tba.v8 (((man chimp) baboon) (mouse rat))                      a score=23262.0      s hg18.chr7    27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG due south panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG s birdie         116834 38 +   4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG s mm4.chr6     53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG s rn3.chr4     81344243 xl + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG                     a score=5062.0                     s hg18.chr7    27699739 6 + 158545518 TAAAGA s panTro1.chr6 28862317 six + 161576975 TAAAGA s birdie         241163 half-dozen +   4622798 TAAAGA  s mm4.chr6     53303881 half-dozen + 151104725 TAAAGA s rn3.chr4     81444246 6 + 187371129 taagga  a score=6636.0 due south hg18.chr7    27707221 13 + 158545518 gcagctgaaaaca s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca s birdie         249182 13 +   4622798 gcagctgaaaaca south mm4.chr6     53310102 thirteen + 151104725 ACAGCTGAAAATA                                                                  


  BAM format

BAM is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and alphabetize-able representation of nucleotide sequence alignments. Many next-generation sequencing and assay tools work with SAM/BAM. For custom track brandish, the main advantage of indexed BAM over PSL and other human-readable alignment formats is that just the portions of the files needed to brandish a detail region are transferred to UCSC. This makes it possible to brandish alignments from files that are and then large that the connection to UCSC would fourth dimension out when attempting to upload the whole file to UCSC. Both the BAM file and its associated index file remain on your web-accessible server (http or ftp), not on the UCSC server. UCSC temporarily caches the accessed portions of the files to speed up interactive display.

Click here for more information about BAM custom tracks.



  WIG format

Wiggle format (WIG) allows the display of continuous-valued data in a track format. Click here for more information.



  bigWig format

The bigWig format is for display of dense, continuous information that will be displayed in the Genome Browser as a graph. BigWig files are created initially from jerk (wig) blazon files, using the plan wigToBigWig. Alternatively, bigWig files tin can be created from bedGraph files, using the program bedGraphToBigWig. In either instance, the resulting bigWig files are in an indexed binary format. The main advantage of the bigWig files is that only the portions of the files needed to display a item region are transferred to UCSC, so for large data sets bigWig is considerably faster than regular wiggle files. The bigWig file remains on your web accessible server (http, https, or ftp), not on the UCSC server. Just the portion that is needed for the chromosomal position you are currently viewing is locally buried every bit a "sparse file".

Click hither for more than information on the bigWig format.



  Microarray format

The datasets for the congenital-in microarray tracks in the Genome Browser are stored in BED15 format, an extension of BED format that includes three additional fields: expCount, expIds, and expScores. To display correctly in the Genome Browser, microarray tracks require the setting of several attributes in the trackDb file associated with the track'south genome associates. Each microarray track set must besides have an associated microarrayGroups.ra configuration file that contains boosted information about the data in each of the arrays.

User-created microarray custom tracks are like in format to BED custom tracks with the addition of iii required rail line parameters in the header--expNames, expScale, and expStep--that mimic the trackDb and microarrayGroups.ra settings of born microarray tracks.

For a consummate clarification of the microarray track format and an caption of how to construct a microarray custom track, see the Genome Browser Wiki.



  .2bit format

A .2bit file stores multiple Dna sequences (up to 4 Gb total) in a meaty randomly-attainable format. The file contains masking information too as the Deoxyribonucleic acid itself.

The file begins with a 16-byte header containing the post-obit fields:

  • signature - the number 0x1A412743 in the architecture of the machine that created the file
  • version - aught for now. Readers should abort if they run into a version number higher than 0.
  • sequenceCount - the number of sequences in the file.
  • reserved - always nil for now

All fields are 32 bits unless noted. If the signature value is not as given, the reader programme should byte-bandy the signature and check if the swapped version matches. If so, all multiple-byte entities in the file volition have to be byte-swapped. This enables these binary files to be used unchanged on different architectures.

The header is followed past a file alphabetize, which contains i entry for each sequence. Each index entry contains three fields:

  • nameSize - a byte containing the length of the name field
  • proper name - the sequence name itself, of variable length depending on nameSize
  • offset - the 32-bit kickoff of the sequence information relative to the start of the file

The alphabetize is followed past the sequence records, which incorporate nine fields:

  • dnaSize - number of bases of Dna in the sequence
  • nBlockCount - the number of blocks of Ns in the file (representing unknown sequence)
  • nBlockStarts - an array of length nBlockCount of 32 bit integers indicating the starting position of a block of Ns
  • nBlockSizes - an array of length nBlockCount of 32 flake integers indicating the length of a block of Ns
  • maskBlockCount - the number of masked (lower-instance) blocks
  • maskBlockStarts - an array of length maskBlockCount of 32 scrap integers indicating the starting position of a masked block
  • maskBlockSizes - an array of length maskBlockCount of 32 fleck integers indicating the length of a masked block
  • reserved - e'er zero for at present
  • packedDna - the Deoxyribonucleic acid packed to two bits per base, represented as and then: T - 00, C - 01, A - 10, Thou - 11. The starting time base is in the most significant 2-flake byte; the last base is in the least significant two bits. For example, the sequence TCAG is represented every bit 00011011.

For a consummate definition of all fields in the twoBit format, see this description in the source code.



  .pecker format

The .nib format pre-dates the .2bit format and is less compact. It describes a DNA sequence by packing two bases into each byte. Each .bill file contains simply a single sequence. The file begins with a 32-bit signature that is 0x6BE93D3A in the architecture of the machine that created the file (or possibly a byte-swapped version of the same number on another machine). This is followed by a 32-bit number in the same format that describes the number of bases in the file. Side by side, the bases themselves are listed, packed ii bases to the byte. The first base of operations is packed in the loftier-order 4 bits (nibble); the second base is packed in the depression-lodge four $.25:

byte = (base1<<four) + base2

The numerical representations for the bases are:

  • 0 - T
  • 1 - C
  • 2 - A
  • three - K
  • 4 - North (unknown)

The most significant bit in a nibble is prepare if the base is masked.



  GenePred tabular array format

genePred is a table format commonly used for cistron prediction tracks in the Genome Browser. Variations of the genePred format are listed below.

If you would like to obtain browser data in GFF (GTF) format, please refer to Genes in gtf or gff format on the Wiki.

Gene Predictions

The post-obit definition is used for gene prediction tables. In alternative-splicing situations, each transcript has a row in this tabular array.
tabular array genePred "A gene prediction."     (     string  proper noun;               "Name of cistron"     cord  chrom;              "Chromosome proper noun"     char[ane] strand;             "+ or - for strand"     uint    txStart;            "Transcription beginning position"     uint    txEnd;              "Transcription terminate position"     uint    cdsStart;           "Coding region start"     uint    cdsEnd;             "Coding region stop"     uint    exonCount;          "Number of exons"     uint[exonCount] exonStarts; "Exon get-go positions"     uint[exonCount] exonEnds;   "Exon end positions"     )                    

Cistron Predictions (Extended)

The following definition is used for extended gene prediction tables. In culling-splicing situations, each transcript has a row in this table. The refGene tabular array is an instance of the genePredExt format.
table genePredExt "A factor prediction with some additional info."     (     string name;        	"Name of gene (ordinarily transcript_id from GTF)"     string chrom;       	"Chromosome name"     char[1] strand;     	"+ or - for strand"     uint txStart;       	"Transcription start position"     uint txEnd;         	"Transcription end position"     uint cdsStart;      	"Coding region starting time"     uint cdsEnd;        	"Coding region end"     uint exonCount;     	"Number of exons"     uint[exonCount] exonStarts; "Exon start positions"     uint[exonCount] exonEnds;   "Exon end positions"     uint id;            	"Unique identifier"     string name2;       	"Alternate proper name (e.1000. gene_id from GTF)"     string cdsStartStat; 	"enum('none','unk','incmpl','cmpl')"     string cdsEndStat;   	"enum('none','unk','incmpl','cmpl')"     lstring exonFrames; 	"Exon frame offsets {0,1,2}"     )                    

Cistron Predictions and RefSeq Genes with Gene Names

A version of genePred that associates the cistron name with the gene prediction information. In culling-splicing situations, each transcript has a row in this tabular array.
table refFlat "A gene prediction with additional geneName field."     (     string  geneName;           "Name of cistron as information technology appears in Genome Browser."     string  proper noun;               "Name of gene"     cord  chrom;              "Chromosome name"     char[1] strand;             "+ or - for strand"     uint    txStart;            "Transcription kickoff position"     uint    txEnd;              "Transcription end position"     uint    cdsStart;           "Coding region showtime"     uint    cdsEnd;             "Coding region end"     uint    exonCount;          "Number of exons"     uint[exonCount] exonStarts; "Exon first positions"     uint[exonCount] exonEnds;   "Exon cease positions"     )                    


  Personal Genome SNP format

This format is for displaying SNPs from personal genomes. It is the aforementioned as is used for the Genome Variants and Population Variants tracks.

  1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base of operations in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the starting time 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. name - The allele or alleles, consisting of one or more A, C, T, or M, optionally followed by one or more '/' and another allele (there can be more than ii alleles). A '-' can exist used in identify of a base to denote an insertion or deletion; if the position given is zero bases wide, it is an insertion. The alleles are expected to be for the plus strand.
  5. alleleCount - The number of alleles listed in the proper noun field.
  6. alleleFreq - A comma-separated list of the frequency of each allele, given in the same lodge as the name field. If unknown, a list of zeroes (matching the alleleCount) should be used.
  7. alleleScores - A comma-separated list of the quality score of each allele, given in the same social club as the name field. If unknown, a listing of zeroes (matching the alleleCount) should be used.

In the Genome Browser, when viewing the frontward strand of the reference genome (the normal instance), the displayed alleles are relative to the forrad strand. When viewing the reverse strand of the reference genome (via the "<--" or "reverse" push button), the displayed alleles are reverse-complemented to match the reverse strand. If the allele frequencies are given, the coloring of the box will reflect the frequency for each allele.

The details pages for this track blazon volition automatically compute amino acrid changes for coding SNPs as well as give a chart of amino acid backdrop if in that location is a not-synonymous change. (The Sift and PolyPhen predictions that are in some of the Genome Variants subtracks are not available.)

Example:
Here is an example of an annotation rail in Personal Genome SNP format. The offset SNP using a '-' is an insertion; the second is a deletion. The last 3 SNPs are in a coding region.

                      rail type=pgSnp visibility=iii db=hg19 proper name="pgSnp" description="Personal Genome SNP instance" browser position chr21:31811924-31812937 chr21	31812007	31812008	T/One thousand	two	21,70	xc,70 chr21	31812031	31812032	T/K/A	3	ix,60,7	eighty,80,xxx chr21	31812035	31812035	-/CGG	2	20,80	0,0 chr21	31812088	31812093	-/CTCGG	two	thirty,70	0,0 chr21	31812277	31812278	T	1	15	xc chr21	31812771	31812772	A	one	36	eighty chr21	31812827	31812828	A/T	2	15,5	0,0 chr21	31812879	31812880	C	i	0	0                                          


Source: http://seabass.mpipz.de/FAQ/FAQformat.html

Posted by: houstonallond.blogspot.com

0 Response to "How To Upload Gff File To Ucsc Genome Browser"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel