Software and Computing Resources
Our software contributions
We contribute to several open-source projects including R and Bioconductor and core member Henrik Bengtsson is the creator of the Aroma Project and the Futureverse. Being parts of these projects helps us to keep up to date with the field and to get invaluable feedback on our own software and work. Please contact Henrik Bengtsson to discuss software projects.
Here are some of the software tools that we have developed ourselves or contributed to:
affxparser: Software for parsing Affymetrix microarray files.
aroma: Preprocessing of spotted microarray data.
aroma.affymetrix: Analysis of small to extremely large Affymetrix microarray data sets.
aroma.seq: High-throughput sequence (HT-Seq) analysis in the Aroma framework. (Pre-release.)
babel: Ribosome profiling data analysis using statistical methods with the same name.
DNAcopy: Circular Binary Segmentation (CBS) method for aCGH copy number analysis.
EGAN: Exploratory Gene Association Networks.
future: Asynchronous (parallel/distributed) processing in R on single machines, on large compute clusters, and in the cloud.
illuminaio: Software for parsing Illumina microarray files.
matrixStats: Fast and memory-efficient mathematical operations on matrices.
partDSA: Piecewise constant estimation of increasingly complex predictors.
PSCBS: Parent-specific copy number segmentation using CBS.
QDNAseq: Quantitative DNA sequencing for chromosomal aberrations using shallow DNA-Seq.
R.matlab: R-to-MATLAB connectivity and methods for reading and writing MAT files.
R.rsp: Dynamic generation of scientific reports for reproducible research.
sfit: Multidimensional simplex fitting.
Software and Reproducible Research
One of our priorities is to provide scientifically sound and reproducible research results. In order to achieve this we make use of a large number of high-quality computational software tools provided by either industry or academia. We try to use open-source software as much as possible, particularly because it is key to reproducible research.
Large-Scale Computing
The amount of data being collected in genomic research has grown dramatically. It has been less than a decade ago since Affymetrix SNP array data (~60 MiB/sample) were considered large. Many software tools could handle only 10-20 arrays in multi-sample studies. This was one of the reason Henrik Bengtsson developed the Aroma Project, which handles tens of thousands of arrays even on systems with limited memory resources. When high-throughput sequencing (HT-Seq) entered the arena, there was a paradigm shift in the amount of data that needed to be processed. Sequencing the DNA of a single human genome at 50 times coverage produces a ~250 GiB data file of aligned reads. Yes, that is ~4000 times larger file than what we get with microarray technologies. (This does not mean that we get 4000 times more “information” from HT-Seq data, but that is a different story.) We are now extending the Aroma Project for it to support HT-Seq analysis as well.
Programming Languages
We are experienced in programming languages such as C, C++, Java, Perl, Python, R, and Ruby, to name a few.