PetaGene glossary and file formats
This glossary page gives details of the terminology and file formats relating to PetaSuite genomic data compression software for FASTQ and BAM files.
We also have a frequently asked questions (FAQ) page which contains more information. If you cannot find the answer to your query here or on the FAQ please use the contact us form to send us a message.
By default, PetaSuite compression is completely lossless and does not change the underlying data. BayesCal mode can be used to perform Bayesian refinement of quality scores, which can improve genotyping accuracy. This modification of quality scores also leads to even better compression.
The bqfilt mode applies BayesCal and performs additional filtering for files that have been processed by GATK’s BQSR. This mode can further improve compression and genotyping for post-BQSR datasets. It is recommended in place of the BayesCal mode, and can safely be used on non-BQSR datasets as well.
An (optional) database of species-specific information that can be specified to help improve compression rate. The choice of corpus does not affect correctness — any corpus can safely be used to compress genomic data from any species. If a corpus is specified for compression, the same corpus needs to be installed for decompression. Each species has only one corpus — so the human corpus is the same regardless of which reference was used for aligning the data.
The PetaGene file format in which compressed FASTQ and FASTQ.gz files are stored. When using PetaLink, users see the FASTQ/FASTQ.gz representation, and do not need to interact directly with Fasterq files (except for copying).
See PetaLink (PetaView is the old name).
The PetaGene file format in which compressed BAM files are stored. When using PetaLink, users see the BAM representation, and do not need to interact directly with PGBAM files (except for copying).
A CRAM-compliant compression that addresses deficiencies in regular CRAM compression. We recommend that PetaLink be used to access these in their BAM representation. These files can also be accessed directly as CRAM, however due to inherent limitations of the CRAM format, some fields which cannot be preserved by CRAM and have been corrected by PetaGene, bypass these corrections when accessed directly.
When the PetaLink library is loaded, PetaGene compressed files (PGBAM, PetaGene CRAM, and Fasterq) are accessible via “virtual files” in BAM, FASTQ and/or FASTQ.gz format. Such virtual files are shown as symbolic links in the file system, and can be treated identically to regular files. Access to a virtual file decompresses the underlying, compressed file on-the-fly and on-demand in memory.