Glossary and file formats

PetaGene glossary and file formats

This glossary page gives details of the terminology and file formats relating to PetaSuite genomic data compression software for FASTQ and BAM files.

We also have a frequently asked questions (FAQ) page which contains more information.  If you cannot find the answer to your query here or on the FAQ please use the contact us form to send us a message.   

By default, PetaSuite compression is completely lossless and does not change the underlying data. BayesCal mode can be used to perform Bayesian refinement of quality scores, which can improve genotyping accuracy. This modification of quality scores also leads to even better compression.

The bqfilt mode applies BayesCal and performs additional filtering for files that have been processed by GATK’s BQSR. This mode can further improve compression and genotyping for post-BQSR datasets. It is recommended in place of the BayesCal mode, and can safely be used on non-BQSR datasets as well.

An (optional) database of species-specific information that can be specified to help improve compression rate. The choice of corpus does not affect correctness — any corpus can safely be used to compress genomic data from any species. If a corpus is specified for compression, the same corpus needs to be installed for decompression. Each species has only one corpus — so the human corpus is the same regardless of which reference was used for aligning the data.

The PetaGene file format in which compressed FASTQ and FASTQ.gz files are stored. When using PetaLink, users see the FASTQ/FASTQ.gz representation, and do not need to interact directly with Fasterq files (except for copying).

An LD_PRELOAD library that provides transparent access to uncompressed BAM and FASTQ files via virtual files for PetaGene compressed PGBAM, PetaGene CRAM, and Fasterq files in the filesystem. For the Cloud Edition, PetaLink additionally provides transparent access to cloud object storage.

See PetaLink (PetaView is the old name).

The PetaGene file format in which compressed BAM files are stored. When using PetaLink, users see the BAM representation, and do not need to interact directly with PGBAM files (except for copying).

PetaGene CRAM
A CRAM-compliant compression that addresses deficiencies in regular CRAM compression. We recommend that PetaLink be used to access these in their BAM representation. These files can also be accessed directly as CRAM, however due to inherent limitations of the CRAM format, some fields which cannot be preserved by CRAM and have been corrected by PetaGene, bypass these corrections when accessed directly.

Virtual file
When the PetaLink library is loaded, PetaGene compressed files (PGBAM, PetaGene CRAM, and Fasterq) are accessible via “virtual files” in BAM, FASTQ and/or FASTQ.gz format. Such virtual files are shown as symbolic links in the file system, and can be treated identically to regular files. Access to a virtual file decompresses the underlying, compressed file on-the-fly and on-demand in memory.

Click here to return to the top of the list.

Photo by Markus Spiske on Unsplash.