PetaGene’s compression software addresses challenges caused by growing volumes of genomics data. It achieves up to a 10x reduction in both storage costs and data transfer times compared to BAM and gzipped FASTQ files – this is a 96% reduction compared to raw FASTQ files. It transparently integrates with existing storage infrastructure and bioinformatics pipelines. PetaSuite is a set of scalable complementary software tools that significantly reduce the size and cost of NGS data for storage and transfer.
Petasuite Cloud Edition
PetaSuite Cloud Edition (CE) does everything that standard PetaSuite does, with the additional innovation of enabling a user’s software tools and pipelines to seamlessly integrate with a wide variety of cloud platforms without modification. AWS, Azure, GCP, private cloud and hybrid cloud are all supported transparently.
Our robust, high performance FASTQ.gz and BAM compression will decompress back to exactly match the original file content. There is full validation and MD5 matching, meaning that not only is the internal content of FASTQ.gz and BAM files preserved, but the gzip wrappers will exactly match, allowing simpler archiving procedures to be used.
PetaLink is a powerful virtual file access system. It enables migration of BAM and FASTQ.gz data to more efficient compression formats. For example, after the PetaSuite binary has been used to losslessly compress a BAM file, validate that all data in the BAM has been preserved, and remove the original BAM file, PetaLink makes available a high performance virtual BAM file view of the compressed file, with the filename of the original file, in the same location. This virtual file can then be used just like the original BAM file by Linux toolchains, pipelines and genome browsers transparently.
The Cloud Edition of PetaLink also allows files stored remotely in the cloud to be accessed as if they are local, without downloading them first!
Bayescal Quality Score Refinement
BayesCal uses a Bayesian approach to calculate a more complete posterior estimation of sequencer error. Genotyping accuracy is preserved across the ROC curve, with a net increase. Improved compression is a side effect, increasing compression ratios by a further 30-70% compared with straight lossless compression.
PetaGene lossless compression ratios, compared with CRAM
(human 30x WGS)
|FASTQ.gz, HiSeq X||3.0||67%||Not applicable|
|FASTQ.gz, NovaSeq||4.3||77%||Not applicable|
|BAM, HiSeq X||BWA-mem only||2.2||55%||1.9|
|BAM, HiSeq X||GATK||5.2||81%||1.5|
|BAM, NovaSeq||Isaac only||2.8||64%||2.3|
|BAM, NovaSeq||BWA-mem only||3.2||69%||2.4|
Note: using PetaGene’s optional BayesCal quality score refinement increases the compression ratio by a further 30%–70%.