Reading GFF files
=================

The biojava3-genome library leverages the sequence relationships in biojava3-core to read (gtf,gff2,gff3) files and
write gff3 files. The file formats for gtf, gff2, gff3 are well defined but what gets written in the file is very
flexible. We currently provide support for reading gff files generated by open source gene prediction applications
GeneID, GeneMark and GlimmerHMM. Each prediction algorithm uses a different ontology to describe coding sequence,
exons, start or stop codon which makes it difficult to write a general purpose gff parser that can create biologically
meaningful objects. If the application is simply loading a gff file and drawing a colored glyph then you don't need to
worry about the ontology used. It is easier to support the popular gene prediction algorithms by writing a parser that
is aware of each gene prediction applications ontology.


The following code example takes a 454scaffold file that was used by genemark to predict genes and returns a
collection of ChromosomeSequences. Each chromosome sequence maps to a named entry in the fasta file and would
contain N gene sequences. The gene sequences can be +/- strand with frame shifts and multiple transcriptions.

Passing the collection of ChromsomeSequences to GeneFeatureHelper.getProteinSequences would return all protein
sequences. You can then write the protein sequences to a fasta file.

```java

    LinkedHashMap<String, ChromosomeSequence> chromosomeSequenceList = GeneFeatureHelper.loadFastaAddGeneFeaturesFromGeneMarkGTF(new File("454Scaffolds.fna"), new File("genemark_hmm.gtf"));
    LinkedHashMap<String, ProteinSequence> proteinSequenceList = GeneFeatureHelper.getProteinSequences(chromosomeSequenceList.values());
    FastaWriterHelper.writeProteinSequence(new File("genemark_proteins.faa"), proteinSequenceList.values());
```

You can also output the gene sequence to a fasta file where the coding regions will be upper case and the non-coding regions will be lower case

```java
    LinkedHashMap<String, GeneSequence> geneSequenceHashMap = GeneFeatureHelper.getGeneSequences(chromosomeSequenceList.values());
    Collection<GeneSequence> geneSequences = geneSequenceHashMap.values();
    FastaWriterHelper.writeGeneSequence(new File("genemark_genes.fna"), geneSequences, true);

```

You can easily write out a gff3 view of a ChromosomeSequence with the following code.

```java
    FileOutputStream fo = new FileOutputStream("genemark.gff3");
    GFF3Writer gff3Writer = new GFF3Writer();
    gff3Writer.write(fo, chromosomeSequenceList);
    fo.close();
```

The chromsome sequence becomes the middle layer that represents the essence of what is mapped in a gtf, gff2 or
gff3 file. This makes it fairly easy to write code to convert from gtf to gff3 or from gff2 to gtf. The challenge
is picking the correct ontology for writing into gtf or gff2 formats. You could use feature names used by a
specific gene prediction program or features supported by your favorite genome browser. We would like to provide a
complete set of java classes to do these conversions where the list of supported gene prediction applications and
genome browsers will get longer based on end user requests.


<!--automatically generated footer-->

---

Navigation:
[Home](../README.md)
| [Book 4: The Genomics Module](README.md)
| Chapter 4 : GTF and GFF files

Prev: [Chapter 3 : chromosomal positions](chromosomeposition.md)

Next: [Chapter 5 : Genebank](genebank.md)
