Frequently Asked Questions

Got a question about PetaSuite?

We’ve created this FAQ page to help you find answers to your questions about PetaSuite genomic data compression software for FASTQ and BAM files.

We also have a glossary of terms and file formats that you will encounter when using PetaSuite compression software.

If you cannot find the answer to your query here or on the glossary page please use the contact us form to send us a message.   

Q: What does PetaSuite consist of?

A: PetaSuite is a command line tool for explicit conversion to and from the PetaGene formats; and PetaLink is a user mode library to instantly extend existing applications to handle the PetaGene formats.

Q: How does PetaSuite manage licensing?

A: License checking requires HTTPS access to a specific domain. License compliance checking occurs over encrypted TLS connections. Alternative arrangements are possible, according to client needs.

Q: Which operating systems can run PetaSuite?

A: PetaSuite can currently be installed on Debian or RedHat based operating systems.  We also support Integrative Genomics Viewer (IGV) for Windows and Mac.

Q: Can I install PetaSuite without admin privileges?

A: Yes, admin privileges are not necessary for installing PetaSuite.

Q: Besides the command line tool and library, are any other files required to run PetaSuite?

A: PetaSuite uses a corpus to help maximise compression and decompression. We recommend that you install at least the human corpus. Seventy other species corpuses are available. These corpuses work independently of the reference used for alignment — so it does not matter which reference was used to align your data, and it even works for compressing de-novo aligned data.

Q: Does PetaSuite work with data from any species?

A: Yes, PetaSuite works with data from any species, even if no specific corpus for the target species is available. That includes de-novo aligned data. PetaSuite can also auto-detect the closest matching corpus for optimising compression.

Q: Does PetaSuite fully preserve the data files?

A: Yes, if md5match lossless compression is selected, the compressed data can be restored as bit-for-bit identical to the original BAM and FASTQ.gz files. Therefore, we recommend that the original data is deleted once verification is complete.

Q: Does PetaSuite validate the compressed files against the original FASTQ/BAM files?

A: Yes, quick validation of the first one million reads is enabled as the default. You can also choose to directly check MD5 checksums.

Q: Do I need to change paths in my pipelines to work with the compressed data?

A: No, the PetaLink user mode library ensures that your pipelines will access the compressed virtual versions, with no modifications.

Q: Can multiple input files be merged to a single file on output?

A: Yes, this is the equivalent of concatenating FASTQ.gz files.

Q: Does PetaSuite support load balancing and distributed tasks?

A: PetaSuite supports load balancing using Slurm and similar compatible utilities when processing multiple files.

Q: Which cloud platforms does PetaSuite support?

A: AWS, Google Cloud, Azure, and S3-compatible (e.g. Ali Baba, Oracle), hybrid and private cloud platforms are supported transparently. PetaSuite CE treats cloud destinations as though they were regular directories. The compression operation streams the file from the cloud, compresses locally and streams it back to the cloud with output to the same or different destination. PetaLink also streams decompression operations from cloud platforms.

Q: I know that PetaLink supports all analysis tools out of the box because the files are presented back bit for bit identical after compression. Have you done any formal testing of popular applications?

A: We have an ongoing program of formal testing for popular analysis applications. This list shows the applications we have successfully tested so far. If you do not see an application you use, please contact us to arrange an evaluation of our software.

Tool Version(s) tested Application type
samtools 1.0–1.9 Toolkit
bamtools 2.4.0–2.5.1 Toolkit
bcftools 1.4–1.9 Variant caller
bedtools 2.18.2–2.28.0 Toolkit
BWA-MEM 0.7.13–0.7.17 Mapper
bwa-mem2 2.0pre1 Mapper
GATK 3 3.8-1-0 Pipeline
GATK 4 4.1.2.0 Pipeline
Manta 1.5.0 Variant caller
Picard 2.9.5–2.20.3 Toolkit
PySAM 0.15.3 Toolkit
Sambamba 0.5.1–0.7.0 Toolkit
seqtk 1.1–1.3 Toolkit
Strelka2 2.9.10 Variant caller

 

 

Photo by Helloquence on Unsplash