January 2021 - PetaGene

HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts 2) is a graph-based read mapping tool for both DNA and RNA sequences.

HISAT2 enables a fast search through its graph index, mapping reads to the entire human genome along with a large number of variants. Since it is a widely used tool, at PetaGene we have added it to our test suite to ensure our compressed files and transparent decompression technology work optimally with HISAT2.

In this blog post we demonstrate the compatibility of PetaGene software with HISAT2, showing benefits such as faster transfer and shorter analysis times as a result of working with compressed data.

PetaGene’s compression software, PetaSuite, is used to losslessly compress FASTQ.gz and BAM files. The compressed files are between 60%–90% smaller, helping our customers save on storage costs and store more samples per petabyte of storage space.

In addition, PetaSuite comes with a user-mode library called PetaLink that transparently handles file decompression on-the-fly. This allows tools to interact with the compressed files via a virtual file so they see only the original, uncompressed data. There is, therefore, no need to modify any tools or pipelines, and there is also no need to decompress the files before use. PetaLink allows for fast random access into the file while performing transparent just-in-time decompression. Thus, PetaLink ensures that PetaGene compressed files are interoperable with existing tools and workflows.

Furthermore, users can connect their cloud/object storage buckets using the pgman utility, and PetaLink reads these cloud paths as if they are local filesystem paths. This means that, thanks to PetaLink and pgman, all tools and workflows work seamlessly with cloud paths for both reading and writing.

To test HISAT2 with PetaGene’s software, we losslessly compressed paired, FASTQ.gz reads from an RNA-seq experiment using PetaSuite Cloud Edition (CE). PetaSuite CE is able to write files directly to cloud storage. In this example, the files are compressed and streamed directly to an AWS S3 bucket created for this study.

$ time petasuite --compress --md5match --validation md5full \
--species auto ERR1138635_1.fastq.gz ERR1138635_2.fastq.gz \
--dstpath s3://demo_hisat/hisat2_testfiles/compressed

Comparison of file sizes before and after compression — Figure 1. Comparison of file sizes after gzip compression (grey) and after PetaSuite compression (blue) for paired FASTQ files. The PetaSuite-compressed files are 60% smaller than the original gzipped files.

The PetaSuite compressed data are used as the input into HISAT2. To do this the PetaLink decompression library needs to be loaded. PetaLink transparently handles decompression of the compressed files, serving original uncompressed data to HISAT2. With the PetaLink library loaded, virtual, uncompressed files are inserted into the filesystem (shown below in blue). These virtual files look and behave exactly as the original uncompressed file would, but the virtual files do not consume any inode resources.

$ LD_PRELOAD=/usr/lib/petalink.so bash # load petalink
$ ls -lh s3://demo_hisat/hisat2_testfiles/compressed
ERR1138635_1.fasterq
ERR1138635_1.fastq.gz -> ERR1138635_1.fasterq.#.fastq.gz
ERR1138635_2.fasterq
ERR1138635_2.fastq.gz -> ERR1138635_2.fasterq.#.fastq.gz

By default, we present virtual files as FASTQ.gz files. However, HISAT2 takes unzipped FASTQ files as the input, normally performing an unzip step if a FASTQ.gz file is used. With PetaSuite-compressed files, it is possible to have each compressed file presented back as multiple different virtual files by simply setting an environment variable. Below, we set the environment variable to generate a virtual FASTQ file. Thus one FASTERQ file can be presented as multiple different virtual files.

$ export PetaLinkMode=+fastq
$ ls -lh s3://demo_hisat/hisat2_testfiles/compressed
ERR1138635_1.fasterq
ERR1138635_1.fastq -> ERR1138635_1.fasterq.#.fastq
ERR1138635_1.fastq.gz -> ERR1138635_1.fasterq.#.fastq.gz
ERR1138635_2.fasterq
ERR1138635_2.fastq -> ERR1138635_2.fasterq.#.fastq
ERR1138635_2.fastq.gz -> ERR1138635_2.fasterq.#.fastq.gz

These virtual files can be used directly with HISAT2, e.g.:

$ export PetaLinkMode=+fastq
$ hisat2 -x grch38/genome \
-1 s3://demo_hisat/hisat2_testfiles/compressed/ERR1138635_1.fastq \
-2 s3://demo_hisat/hisat2_testfiles/compressed/ERR1138635_2.fastq

Compressing files using PetaGene’s technology offers a number of benefits. The smaller size means the cost of storing the data is lower. Additionally, transferring the data between storage and computing nodes is faster, reducing I/O bottlenecks. Furthermore, PetaLink ensures transparent, fast access of compressed data to existing tools and workflows without any modifications.

Here, we carried out a series of benchmarks showing that HISAT2 works with our compressed files and the transparent decompression library, giving performance improvements in the speed of transfer and analysis.

To test the performance gains when using compressed vs uncompressed data we used the pipeline shown in Figure 2.

Results

1. PetaGene-compressed files transparently work with HISAT2.

This demo shows that PetaGene compressed files can be used as input into HISAT2 using the PetaLink library to perform transparent decompression to serve original, uncompressed data to HISAT2. The subsequent output data is in the SAM format, and so, to generate a BAM file, we piped the output to samtools (samtools sort -@ 8 -O BAM -o output.bam) as shown in Figure 2.

2. Runtime is shorter when using PetaGene compressed files.

When executing the pipeline, we measured the time it took to process the samples when the data was stored locally on the instance or by directly streaming the data from S3 (Figure 3). In both scenarios, using compressed files sped up the analysis with shorter runtimes recorded. Importantly, PetaLink Cloud Edition carries out transparent, just-in-time decompression and allows streaming data directly from S3.

3. Streaming directly from the cloud removes the need for local storage on the instance.

By streaming the data directly from S3 using PetaLink, we didn’t need to attach large storage to the compute instance.

4. PetaLink allows HISAT2 to take input files that are stored in the cloud.

Furthermore, PetaLink enables HISAT2 to work with data stored in cloud/object storage (AWS, GCP, Azure and any S3-compliant object storage) without any need to modify HISAT2.

Figure 3. Compressed files have a shorter runtime than the original data independent of whether the files are stored locally on the instance or in cloud/object storage (S3).

PetaLink technology enables:

Cloud-enabling technology — allowing tools to natively work with data stored in the cloud.

With the PetaLink library loaded, standard cp can be used to copy files directly from cloud storage to the instance. Here we copied the paired, FASTQ.qz files (7.62 GB) from S3 to the C4.8x large instance, which has a 10 gigabit network connection. We used the standard Linux cp binary and the aws s3 cp command-line tool.

High-speed transfer of data

Using cp + PetaLink saw a speedup of 40% on average, with the average speed of transfer 1.85 Gbps compared to 1.32 Gbps for aws s3 cp.

Figure 4. PetaLink enables the standard Linux cp to download the paired FASTQ.gz files (7.62 GB) directly from cloud storage and is faster than AWS S3 cp.

Summary

PetaGene offers bioinformatics users lossless compression software that helps to reduce storage costs and accelerate data transfer and analysis.

- PetaGene reduced the storage costs by 60%, reduced the transfer time for uncompressed data by 40%, reduced the transfer time for compressed data by 76%, and reduced the analysis time by about 7%.

- Using our PetaLink library, decompression of the files is handled transparently, meaning that compressed files can be used for day-to-day analysis as PetaLink ensures that all tools and applications see original, uncompressed data. The analysis runs faster when the compressed data is used as input in this way.

- PetaLink Cloud has additional features: users are able to pair public as well as private cloud storage buckets, and access cloud paths as if they were local filesystem paths.

- Standard Linux tools and other binaries are able to work with cloud paths when PetaLink is enabled. Here we described how the standard Linux cp command works with cloud paths, and that the transfer is accelerated when compared with the AWS CLI copy tool.

- Furthermore, HISAT2 is able to read input files located in the cloud. This means users avoid the need to first download the data to an instance. It also reduces the amount of storage space required on an instance and allows users to directly stream data from the cloud.

Overall, at PetaGene we have architected our compression solution to ensure data integrity, interoperability with all tools and pipelines through transparent decompression, and support for cloud storage to help reduce costs of cloud computing.

Sample files were downloaded from ENA, and the HISAT2 genome references were copied from the AWS Registry for Open Data.

Some additional information about the samples used in this analysis:

- - Illumina HiSeq 2500 paired-end sequencing;
  - RNA-seq of liver tissue and liver cancer cell lines

Datasets - https://www.ebi.ac.uk/ena/browser/view/ERR1138635

Reference - https://registry.opendata.aws/jhu-indexes/

Month: January 2021

HISAT2 benchmarked with PetaGene’s compression and transparent readback tools

Results

PetaLink technology enables:

Summary

PetaGene’s customers have compressed over 3 million genome files