PetaLink ******** PetaLink is a powerful virtual file access system. It enables migration of BAM and FASTQ.GZ data to more efficient compression formats. If you have received files in compressed ".pgbam" or ".fasterq" formats, PetaLink allows you to transparently access these as the original BAM or FASTQ.GZ files. This access is made available via a high performance virtual BAM or FASTQ.GZ file view of the compressed file, with the filename of the original file, presented as a symbolic link alongside the ".pgbam" or ".fasterq" file. PetaLink performs on-the-fly, random access decompression via this virtual file, without physically decompressing the file. This virtual file can therefore be used just like the original BAM or FASTQ.GZ file by Linux toolchains, pipelines and genome browsers. This on-the-fly decompression does not slow down your analysis - it speeds it up because of the I/O savings. PetaLink also allows you to physically decompress the files if you prefer, but that would be denying yourself the benefit of the smaller file footprint and the faster analysis speed. Installing PetaLink =================== After downloading the PetaLink package, assign to it an execute permission "+x": chmod +x petalink_X.X.X.run Then you can execute the self-extracting package by running: ./petalink_X.X.X.run A directory called "petalink_X.X.X" will be created after reading and accepting the End User License Agreement. The contents of this directory will be as follows: petalink-eula.txt PetaLink-README.txt petalink_install_corpus bin/petalink.so We recommend that you move this directory to the most appropriate path in your file system and create a symbolic link to the "petalink.so" library for convenience. For example, in the case where you have the necessary privileges: mkdir /opt/petagene mv petalink_1.2.6 -t /opt/petagene ln -s /opt/petagene/petalink_1.2.6/bin/petalink.so /usr/lib/ Determine if a corpus is needed =============================== The organization that created the compressed ".pgbam" or ".fasterq" files that you have received might have used a corpus to aid compression. If so, you will need to install a subset of the corpus (decompression corpus) of the relevant species to allow PetaLink to decompress the files when you access them. On the other hand, if the ".pgbam" or ".fasterq" files were created without a corpus, then PetaLink does not need one. Installing a corpus =================== The "petalink_install_corpus" script can be used to install the decompression corpus for a particular species, or for all 71 currently available species which add up to a total of 32 GB, by running: cd # change to PetaLink's directory ./petalink_install_corpus [species | all] [optional:installation-path] # run the installation script If you need to download and install only the human decompression corpus (682 MB), replace species with human in the command above. You can also specify a particular species that you need (see the list at the end of this document). Please note that it does not matter what the actual species of the sample is, what is important is what corpus was used for compression. By default the corpuses are installed in: ./species To use PetaLink with an alternative location, please set the "PETASUITE_REFPATH" environment variable to point to the installation path. Multiple paths can be searched, separated by a colon ":". For example: export PETASUITE_REFPATH=/home/user/path:/species PetaLink Usage ============== In the extracted contents of the package ("petalink_X.X.X" directory) you can find the "petalink.so" library in the "bin" sub-directory. PetaLink can be started manually as well as automatically. For manual startup, specify "LD_PRELOAD=/bin/petalink.so" before starting a command. For example, to start a "bash" instance with PetaLink loaded run: LD_PRELOAD=/bin/petalink.so bash or if the symbolic link was created following the example given in the 'Installing PetaLink' section above, you can instead run: LD_PRELOAD=/usr/lib/petalink.so bash This instance of "bash" and any commands executed from within this "bash" instance will run with the PetaLink library running. This does not affect other instances of "bash" or other processes, which means that it can be useful to run PetaLink automatically instead. For automatic startup, the easiest method is to modify a startup script and to define this environment variable. For example, adding the following line to ".bashrc": export LD_PRELOAD=/bin/petalink.so ensures that PetaLink is loaded whenever "bash" is started. With PetaLink loaded, the data within a compressed file can be accessed via a high performance virtual BAM or FASTQ.GZ file view of the compressed file, with the filename of the original file, which is presented as a symbolic link alongside the ".pgbam" or ".fasterq" file. Configuring PetaLink ==================== Setting the environment variable "PetaLinkMode" can be used to further configure PetaLink, including while already running. Here are the options available for "PetaLinkMode" from this version onwards: +----------------------------------------------------+----------------------------------------------------+ | off | Disables PetaLink. | +----------------------------------------------------+----------------------------------------------------+ | +static | Enables interception of statically compiled | | | binaries. | +----------------------------------------------------+----------------------------------------------------+ | -static | Disables interception of statically compiled | | | binaries. (default) | +----------------------------------------------------+----------------------------------------------------+ | +closekeepcache[=N] | Keep recently closed virtual files cached for | | | N*0.1 seconds. This helps the performance of | | | applications like Apache which frequently | | | open/close/reopen files. (Disabled by default, if | | | "N" not specified then 1 second by default, i.e. | | | "N=10") | +----------------------------------------------------+----------------------------------------------------+ +----------------------------------------------------+----------------------------------------------------+ | md5match | Virtual file should be recompressed to match the | | | original file’s MD5 checksum (slow). Requires | | | original BAI file to be present for indexing. | +----------------------------------------------------+----------------------------------------------------+ | +mt | Enables multi-threaded decompression of virtual | | | BAM and ".fastq" files. (default) | +----------------------------------------------------+----------------------------------------------------+ | +mt[=numthread] | Sets a limit on the number N of threads for | | | decompression of virtual BAM and ".fastq" files. | +----------------------------------------------------+----------------------------------------------------+ | -mt | Disables multi-threaded decompression of virtual | | | BAM and ".fastq" files. | +----------------------------------------------------+----------------------------------------------------+ | +fastq | ".fastq" virtual file is made available for each | | | FASTERQ file. | +----------------------------------------------------+----------------------------------------------------+ | +fq | ".fq" virtual file is made available for each | | | FASTERQ file. | +----------------------------------------------------+----------------------------------------------------+ | +fqgz | ".fq.gz" virtual file is made available for each | | | FASTERQ file. | +----------------------------------------------------+----------------------------------------------------+ | +fastqgz | ".fastq.gz" virtual file is made available for | | | each FasterQ file. (default) | +----------------------------------------------------+----------------------------------------------------+ | -fastqgz | The default ".fastq.gz" virtual file is not made | | | available for each FASTERQ file. | +----------------------------------------------------+----------------------------------------------------+ | +pgbam | ".bam" virtual file is made available for each | | | PetaGene-compressed BAM file. (default) | +----------------------------------------------------+----------------------------------------------------+ | -pgbam | Default ".bam" virtual file is not made available | | | for each PetaGene- compressed BAM. | +----------------------------------------------------+----------------------------------------------------+ | +cram | ".bam" virtual file is made available for each | | | CRAM file. (default) | +----------------------------------------------------+----------------------------------------------------+ | -cram | The default ".bam" virtual file is not made | | | available for each CRAM file. | +----------------------------------------------------+----------------------------------------------------+ | +autoshowBAI | Directory listings automatically remap | | | "sample.pgbai" (or "sample.cram.pgbai") to | | | "sample.bai" in addition to "sample.bam.bai" if | | | the original, uncompressed BAM file had an | | | accompanying "sample.bai" index file, and MD5 | | | match mode is disabled. (default) | +----------------------------------------------------+----------------------------------------------------+ | -autoshowBAI | Disable showing a "sample.bai" file for virtual | | | index files, even if such a file existed before | | | compression. | +----------------------------------------------------+----------------------------------------------------+ | +allowBAI | Even when no virtual "sample.bai" file is shown in | | | file listings, permit access to such a file for | | | matching virtual "sample.bam" BAM files. (default) | +----------------------------------------------------+----------------------------------------------------+ | -allowBAI | Disable all virtual "sample.bai" file access. | +----------------------------------------------------+----------------------------------------------------+ | +write[=N] | Enable write interception. When active, ".bam" and | | | ".fastq(.gz)" files that are written will be | | | automatically compressed and saved as PGBAM and | | | FASTERQ files, respectively. By default, up to | | | "N=20" simultaneously opened files are supported. | | | Any further files opened for writing will not be | | | intercepted, and will be saved uncompressed. | +----------------------------------------------------+----------------------------------------------------+ For example, to replace ".fastq.gz" virtual files with ".fq.gz" and ".fq" ones, instead do: export PetaLinkMode="+fqgz -fastqgz +fq" Note that quotation characters (i.e. ") are needed to enclose the options in this example due to the presence of spaces to separate options. If the original file was compressed in PetaSuite with option "-- md5match" enabled, then when "PetaLinkMode" option "md5match" is selected (no plus or minus character before it), the virtual file representation should match the original compressed BAM or FASTQ.GZ file. Note that enabling this option can be very CPU-intensive and is generally discouraged unless exact MD5 matching is absolutely needed by the tool. Pipelines using virtual FASTQ.GZ and BAM files with this option disabled will still see the same raw data and will run faster. This option is disabled by default for high throughput performance. When high throughput performance is required, we recommend that multi- threaded decompression be enabled in PetaLink, and md5match should be disabled. Multi-threaded mode is enabled by default with number of threads scaling according to demand, and the explicit option for this is: export PetaLinkMode="+mt" You can also explicitly limit the number of threads for multithreading by specifying a number such as four: export PetaLinkMode="+mt=4" If left without a parameter, the number of threads is automatically scaled according to usage. To disable multi-threaded mode use: export PetaLinkMode="-mt" Decompressing PetaGene-compressed files ======================================= You can get the original BAM or FASTQ.GZ from a PetaGene-compressed file easily by ensuring "md5match" mode is selected and copying the virtual BAM or FASTQ.GZ generated by the PetaLink library to a new location. For example, the following sequence of commands generates the original BAM from the PetaGene-compressed "sample.pgbam": export LD_PRELOAD= # load the PetaLink library export PetaLinkMode="md5match" # match the original file's MD5 checksum mkdir -p original # create a new directory to store the original file cp sample.bam original # decompress the original file Note that there is actually no need to recreate the original file like this, since you do not need it. PetaLink allows the data to be accessed directly via a high performance virtual BAM or FASTQ.GZ file view of the compressed file, with the filename of the original file, which is presented as a symbolic link alongside the ".pgbam" or ".fasterq" file. If you physically recreate the original file, you would be denying yourself the benefit of the smaller file footprint and the faster analysis speed. Species ======= Here is the list of the names of the 71 species for which corpuses are available: +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | human | gallus_gallus | ovis_aries | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | ailuropoda_melanoleuca | gasterosteus_aculeatus | pan_troglodytes | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | anas_platyrhynchos | gorilla_gorilla | papio_anubis | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | anolis_carolinensis | heliconius_melpomene | pelodiscus_sinensis | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | astyanax_mexicanus | ictidomys_tridecemlineatus | petromyzon_marinus | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | bos_taurus | latimeria_chalumnae | poecilia_formosa | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | caenorhabditis_elegans | lepisosteus_oculatus | pongo_abelii | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | callithrix_jacchus | loxodonta_africana | procavia_capensis | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | canis_familiaris | macaca_mulatta | pteropus_vampyrus | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | cavia_porcellus | macropus_eugenii | rattus_norvegicus | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | chlorocebus_sabaeus | meleagris_gallopavo | saccharomyces_cerevisiae | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | choloepus_hoffmanni | microcebus_murinus | sarcophilus_harrisii | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | ciona_intestinalis | monodelphis_domestica | sorex_araneus | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | ciona_savignyi | mus_musculus | sus_scrofa | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | danio_rerio | mus_spretus_spreteij | taeniopygia_guttata | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | dasypus_novemcinctus | mustela_putorius_furo | takifugu_rubripes | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | dipodomys_ordii | myotis_lucifugus | tarsius_syrichta | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | drosophila_melanogaster | nomascus_leucogenys | tetraodon_nigroviridis | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | echinops_telfairi | ochotona_princeps | tupaia_belangeri | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | equus_caballus | oreochromis_niloticus | tursiops_truncatus | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | erinaceus_europaeus | ornithorhynchus_anatinus | vicugna_pacos | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | felis_catus | oryctolagus_cuniculus | xenopus_tropicalis | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | ficedula_albicollis | oryzias_latipes | xiphophorus_maculatus | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ | gadus_morhua | otolemur_garnettii | | +-------------------------------------------+----------------------------------------------+--------------------------------------------+ License ======= This Software is licensed only to the registered organization that registered to download the Software, and according to the terms of the End User License Agreement (EULA) that was distributed along with the Software and this document. The license is not transferrable. Author and Licenser =================== PetaGene Ltd Betjeman House 104 Hills Road Cambridge CB2 1LQ United Kingdom petalink@petagene.com