# PetaLink PetaLink is a powerful virtual file access system. It enables migration of BAM and FASTQ.gz data to more efficient compression formats. If you have received files in compressed .pgbam or .fasterq formats, PetaLink allows you to transparently access these as the original BAM or FASTQ.gz files. This access is made available via a high performance virtual BAM or FASTQ.gz file view of the compressed file, with the filename of the original file, presented as a symbolic link alongside the .pgbam or .fasterq file. PetaLink performs on-the-fly, random access decompression via this virtual file, without physically decompressing the file. This virtual file can therefore be used just like the original BAM or FASTQ.gz file by Linux toolchains, pipelines and genome browsers. This on-the-fly decompression does not slow down your analysis - it speeds it up because of the I/O savings. ## PetaLink installation After downloading the petalink package, assign to it an execute permission +x: chmod +x petalink_X.X.X.run-XXXX Then you can execute the self-extracting package by running: ./petalink_X.X.X.run-XXXX A directory called 'petalink_X.X.X’ will be created after reading and accepting the End User License Agreement. ## Usage In the extracted contents of the package ('petalink_X.X.X’ directory) you can find the 'petalink.so' library. PetaLink can be started manually as well as automatically. For manual startup, specify LD_PRELOAD= before starting a command. For example, to start a bash instance with PetaLink loaded run: LD_PRELOAD= bash This instance of bash and any commands executed from within this bash instance will run with the PetaLink library running. This does not affect other instances of bash or other processes, which means that it can be useful to run PetaLink automatically instead. For automatic startup, the easiest method is to modify a startup script and to define this environment variable. For example, adding the following line to .bashrc: export LD_PRELOAD= ensures that PetaLink is loaded whenever bash is started. Setting the environment variable PetaLinkMode can be used to further configure PetaLink, including while already running. Here are the options available for PetaLinkMode: off Disables PetaLink +mt Enables multi-threaded decompression of virtual BAM and .fastq files (default) +mt[=numthread] Sets limit on number of threads for decompression of virtual BAM and .fastq files -mt Disables multi-threaded decompression of virtual BAM and .fastq files +fastq .fastq virtual file is made available for each FasterQ file +fq .fq virtual file is made available for each FasterQ file +fqgz .fq.gz virtual file is made available for each FasterQ file +fastqgz .fastq.gz virtual file is made available for each FasterQ file (default) -fastqgz The default .fastq.gz virtual file is not made available for each FasterQ file +pgbam .bam virtual file is made available for each PetaGene-compressed BAM file (default) -pgbam Default .bam virtual file is not made available for each PetaGene-compressed BAM +cram .bam virtual file is made available for each CRAM file (default) -cram The default .bam virtual file is not made available for each CRAM file md5match Virtual file should be recompressed to match original file's MD5 checksum (slow) +static Enables interception of statically compiled binaries -static Disables interception of statically compiled binaries (default) +closekeepcache[=N] Keep recently closed virtual files cached for N*0.1 seconds. This helps the performance of applications like Apache which frequently open/close/re-open files. (Disabled by default, if N not specified then 1 second by default i.e. N=10) For example, to replace .fastq.gz virtual files with .fq.gz and .fq ones, instead do: export PetaLinkMode="+fqgz -fastqgz +fq" Note that quotation characters (i.e. ") are needed to enclose the arguments. If the original file was compressed in PetaSuite with option --md5match enabled, then when PetaLinkMode option md5match is selected (no plus or minus character before it), the virtual file representation should match the original compressed BAM or FASTQ.gz file. Note that enabling this option can be very CPU-intensive and is generally discouraged unless exact MD5 matching is absolutely needed by the tool. Pipelines using virtual FASTQ.gz and BAM files with this option disabled will still see the same raw data and will run faster. This option is disabled by default for high throughput performance. When high throughput performance is required, we recommend that multi-threaded decompression be enabled in PetaLink, and md5match should be disabled. Multi-threaded mode is enabled by default with number of threads scaling according to demand, and the explicit option for this is: export PetaLinkMode="+mt" You can also explicitly limit the number of threads for multithreading by specifying a number such as four: export PetaLinkMode="+mt=4" If left without a parameter, the number of threads is automatically scaled according to usage. To disable multi-threaded mode use: export PetaLinkMode="-mt" ## Decompressing PetaGene-compressed files You can get the original BAM or FASTQ.gz from a PetaGene-compressed file easily by copying the virtual BAM or FASTQ.gz generated by the PetaLink library to a new location. For example, the following sequence of commands generates the original BAM from the PetaGene-compressed 'sample.pgbam': export LD_PRELOAD="" # load the PetaLink library export PetaLinkMode="md5match" # match the original file's MD5 checksum mkdir -p original # create a new directory to store the original file cp sample.bam original/. # decompress the original file