Comment on GIGGLE, a search engine for large-scale integrated genome analysis
I just devoured the article presenting GIGGLE, a search engine for large-scale integrated genome analysis [1]. It has been the most exciting reading so far in 2018 and I bet the GIGGLE tool will be widely used by the end of the year. My arguments follow below (the order does not denote priority).
First, the development of GIGGLE is lead by two must-follow researchers in genomics: Ryan Layer and Aaron Quinlan. Aaron Quinlan is the developer of BEDTOOLS [2]. Ryan Layer has also contributed to BEDTOOLS and gave a fantastic talk at the Genome Informatics meeting last fall.
Most of my excitement about GIGGLE is because it tackles one of the most common tasks in bioinformatics: to identify shared genomic coordinates from two or more lists of genomic regions (e.g. genes, repeats). Common as it is, this procedure suffers from the increasing number of datasets that one may want to compare. As the authors write in the paper [1] “While existing software such as BEDTOOLS [2] and TABIX [3] identify regions that are common to genome interval files, these methods were designed to investigate a limited number of files” (the bold is mine). The authors of GIGGLE address this issue and are therefore able to compare thousands of datasets, and do it very fast (Table 1).
Table 1. Speed comparison with existing tools. Figures extracted from the GIGGLE article [1].
Comparison | # query intervals (in millions) | # reference intervals (in millions) | Speed (relative to TABIX) | Speed (relative to BEDTOOLS) |
---|---|---|---|---|
#1 | 1 | 55 | 2,336x | 345x |
#2 | 1 | 6,900 | 25x | 8x |
My interest in GIGGLE grew even more when I saw that the authors also dealt with a key aspect associated with comparing the overlap between two or more lists of genomic features. Imagine you have a list of 100 genes and 100 repeats, and that the genomic coordinates of 75 of them overlap. Is this similarity between the two datasets relevant? A common approach to answer this question is randomization. The genomic coordinates of the lists are shuffled and the degree of overlap is measured. Doing this thousands of times generates a null distribution of the degree of overlap expected by chance, which can be then compared against the observed number of intersections. An empirical p-value can be calculated as the proportion of randomized datasets which show a degree of overlap higher than the true dataset. However, this approach does not scale well with the number of datasets. Instead, GIGGLE uses a Fisher’s Exact two-tailed test, which is extremely quick to compute and gives metrics (p-value and odds ratio) that are well correlated with those obtained with the randomization approach.
The randomization approach has another caveat. Due to the nature of the datasets, the degree of overlap between two datasets may be marginal and yet the randomization test produce significant results. On the other side, some interesting comparisons may not reach the significance threshold. To make it even better, GIGGLE goes a step further and scores the overlap between a pair of genomic features based on a combination of both the p-value and the odds ratio derived from the Fisher’s Exact two-tailed test.
As I said, reading the article was a pleasure. The text is neat and the figures are very illustrative. In addition, in the online version there is a side panel in the right which displays the elements of the article (e.g. references, figures) as they are clicked in the main central text. I do not know for how long Nature Methods has been doing it and whether other journals pioneered this display but it incredibly enhances the reading.
Nice article but… what happens when one tries to use the tool? As a honorable successor of BEDTOOLS, GIGGLE is well documented in a GitHub repo as well as easy to install and use via the command line in a Linux system.
“Command line? Linux system? Not for me”, some may think. Fortunately, the authors also developed a web interface for users to explore their own datasets. I think this broadens the range of people that may use GIGGLE as it empowers wet-lab researchers, which tend to feel less comfortable with command-line tools, to perform basic analyses. Moreover, building interfaces for standard analyses is a way to free bioinformaticians to focus on the most technical parts of the analysis [4] (I apologize for the self-citation). As feedback, I think the link to the interactive tool might have been located in the article rather than buried in the GitHub repo.
To my wish list I would add the possibility to build the index on the fly. Right now, GIGGLE works in 2 steps: the index and the search itself. So imagine that you want to measure the overlap of a list of 100 genes with each of 50 lists of binding regions. First, one needs to create the index of the 50 datasets and then search for the overlap between genes and binding regions. This implies that the index needs to be pre-computed before the search is performed. What if we want to repeat the overlap with 1 additional set of binding regions? What if we need to perform several comparisons each with varying sets of target regions? Of course, building the index on the fly will increase the overall running time but I think that the benefit will be greater for datasets of moderate size. Even for large datasets (in the article, indexing one of the datasets requires about 4.5 hours), the penalty in the overall running time of building the index on the fly could be alleviated by using multiple CPUs –which also goes to my wish list.
Bibliography
[1] Layer RM, Pedersen BS, DiSera T, Marth GT, Gertz J, Quinlan AR. GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods. 2018. doi:10.1038/nmeth.4556.
[2] Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2. doi:10.1093/bioinformatics/btq033.
[3] Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27:718–9. doi:10.1093/bioinformatics/btq671.
[4] Quilez J, Vidal E, Dily FL, Serra F, Cuartero Y, Stadhouders R, et al. Parallel sequencing lives, or what makes large sequencing projects successful. Gigascience. 2017;6:1–6. doi:10.1093/gigascience/gix100.