Control-FREEC: Prediction of copy number alterations and loss of heterozygosity using deep-sequencing data

Manual

Control-FREEC is a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data originally developed by the Bioinformatics Laboratory of Institut Curie (Paris). Nowdays, Control-FREEC is supported by the team of Valentina Boeva at Institut Cochin, Inserm (Paris).

Control-FREEC automatically computes, normalizes, segments copy number and beta allele frequency (BAF) profiles, then calls copy number alterations and LOH. The control (matched normal) sample is optional for whole genome sequencing data but mandatory for whole exome or targeted sequencing data. For whole genome sequencing data analysis, the program can also use mappability data (files created by GEM).

Starting from version v8.0, we provide a possibility to detect subclonal gains and losses and evaluate the likeliest average ploidy of the sample. Also, the procedure for evaluation of tumor purity has been improved.

Starting from Control-FREEC v5.0, the program can be used on exome-sequencing data. Starting from version v8.0, read counts are calculated by exon and not per window (set "window=0").

Starting from Control-FREEC v6.0, the user can use multiple threads to run Control-FREEC. 30x coverage WGS data with a control (i.e., two pileup.gz files) will be fully processed (CNA and LOH info) in one hour using 6 threads.

Input for CNA detection: aligned single-end, paired-end or mate-pair data in SAM, BAM, SAMtools pileup.
Control-FREEC accepts .GZ files. Support of Eland, BED, SOAP, arachne, psl (BLAT) and Bowtie formats has been discontinued starting from version v8.0.
Input for CNA+LOH detection: There are two options: (a) provide aligned reads in SAMtools pileup format. Files can be GZipped; (b) provide BAM files together with options "makePileup" and "fastaFile" (see How to create a config file?)
Output: Regions of gain, loss and LOH, normalized copy number and BAF profiles.

Control-FREEC publications

Control-FREEC: a tool for assessing copy number and allelic content using next generation sequencing data. V. Boeva, T. Popova, K. Bleakley, P. Chiche, I. Janoueix-Lerosey, O. Delattre and E. Barillot. Bioinformatics, 2012, 28(3):423-5. PMID: 22155870.
CNA detection part of Control-FREEC (simply FREEC)
Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. V. Boeva, A. Zinovyev, K. Bleakley, J.-P. Vert, I. Janoueix-Lerosey, O. Delattre and E. Barillot. Bioinformatics, 2011, 27(2):268-9. PMID: 21081509.
LOH detection part of Control-FREEC

Control-FREEC HOWTOs:

Install the Control-FREEC package
Run Control-FREEC on a test data
Create a Config file
Read Control-FREEC's output
Calculate significance of Control-FREEC predictions with an R script (new!)
Visualize Control-FREEC's output
Translate Control-FREEC's output into Bed or Circos formats
Generate a GC-content profile without running Control-FREEC
Run Control-FREEC on something which is not Human:

Contact

Valentina Boeva: valentina.boeva [AT] inserm.fr

Phone: +33.1.44.41.23.89

Distributions:

Control-FREEC is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation (www.fsf.org); either version 2 of the License, or (at your option) any later version.

Test data

Human data (hg18) for HCC1143 and HCC1143-BL (from Chiang et al., 2009) [ download, 146 Mb ]
Human data (hg19) for a tumor chromosome 19 (unpublished, to use to test the LOH detection component of Control-FREEC) [ download, 1334 Mb ]

Installation:

Requirements:
- 10Gb of RAM
- Linux or MACS OS; only older versions of Control-FREEC (up to v6.2) are available for Windows
- R installed
- samtools installed if the input files are in .BAM format
- bedtools installed if you wish to create minipileup files from .BAM
- sambamba installed if you wish to speed up reading of .BAM files
Download the latest release of Control-FREEC: Linux 64-bit version.

[Starting from Control-FREEC v5.7 Windows is no longer supported. But just in case, we compile Control-FREEC v6.2 for you: Control-FREEC v6.2 for Windows 32-bit . Also, you can still download Control-FREEC v5.6 for Windows 32-bit or contact me for support].
Unpack Control-FREEC:
type unzip YOUR_FILE.zip or
tar -zxvf YOUR_FILE.tar.gz
The Win32 and Linux64 archives include binary versions of Control-FREEC. If you are using Linux, type make from the Control-FREEC directory to compile the program (change Makefile by removing 64bit-tags if you are running 32bit version of Linux).
Download mappability tracks if you want to include mappability information:
Do not forget to extract files from the archive! You can also generate a mappability track for other genomes using GEM.
Download files with SNPs (only if you have high coverage data and you want to detect allelic status; then, you must transform read files into pileup format)
- dbSNP151.hg38-commonSNP_minFreq5Perc_with_CHR.vcf.gz
- hg19_dbsnp142_with_CHR_minFreq05.vcf.gz
- hg19_snp131.SingleDiNucl.1based.txt
- hg19_snp137.SingleDiNucl.1based.txt.gz (created by Niklas Malmqvist) Unzip it before the use!!
- hg19_snp142.SingleDiNucl.1based.txt.gz Unzip it before the use!!
- hg18_snp130.SNP.1based.txt
- mm10_snp137 (created by Kim Wong)
Starting from Control-FREEC v9.3, .txt.gz, .vcf and .vcf.gz files are also accepted! For the .txt files with SNPs, please refer to FREEC FAQ Q19 to understand how these files are generated.
Starting from Control-FREEC v10.6, Control-FREEC can work on exome-seq data without a control sample.

Running Control-FREEC on a test data

To run Control-FREEC download and unzip:
- a test dataset for HCC1143 and HCC1143-BL (from Chiang et al., 2009): test.zip (146 M) or
- a test dataset for tumor chromosome 19 (unpublished, use it to test LOH predictions by Control-FREEC): testChr19.zip (1334 M)
Follow the instructions from the README file.

If you are not in the mood to download a test example, you can simply run Control-FREEC with the following command line after compiling (please, check how to download/create a config file below):

PathToFREEC/freec -conf myConfig.txt -sample sample.bam -control control.bam

or

PathToFREEC/freec -conf myConfig.txt (if BAM files are provided directly in the config file)

Creating a Config file

There should be 2-5 groups of parameters in the config file: [general], [sample], [control], [BAF] and [target]. The last three can be empty if there is no control dataset available, if you do not intend to calculate BAF profiles and call genotypes, or if you are not running the program on targeted data (e.g. Whole Exome Sequencing).

[general]: here you describe all general parameters. Most of them are optional; their default value is indicated below.
[sample]: here you provide a path to the sample input file and indicate its type (e.g., bam, sam, pileup).
[control]: here you provide a path to the contol input file and indicate its type. You may have this group empty if you do not have a control dataset.
[BAF]: here you provide paths to files and parameters to calculate B allele frequency (BAF) profiles and detect LOH regions.
[target]: here you can provide a .bed file with coordinates of probes, exons or amplicons when you run Control-FREEC on exome sequencing data or other targeted sequencing data. Set "window=0" in the [general] to use read count "per exon" instead of "per window".

[general] group of parameters:

parameter	description	possible values
BedGraphOutput	set "BedGraphOutput=TRUE" if you want an additional output in BedGraph format for the UCSC genome browser	Default: FALSE
bedtools	path to bedtools (used only to create .pileup files for WES data)	Default: bedtools
breakPointThreshold	positive value of threshold for segmentation of normalized profiles	Default: 0.8 use something like 0.6 to get more segments (and thus more predicted CNVs)
breakPointType	desired behavior in the ambiguous regions (poly-N or low mappability regions between two different copy number values)	0: the "unknown" region is attached to the "known" region on the right 1: make a separate fragment of this “unknown” region and then attaches it to the left or to the right region choosing the longer one 2: make a separate fragment of this “unknown” region and then attaches it to the left or to the right region but the “ploidy” copy number has a priority 3: make a separate fragment of this “unknown” region and then attaches it to the left or to the right region choosing the longer one but this “known” region should make at least half-size of the “unknown” region 4: make a separate fragment of this “unknown” region and do not assign any copy number to this region at all Default: 2
chrFiles	path to the directory with chromosomes fasta files (necessary to calculate a GC-content profile if a control dataset and GC-content profile are not available)	Ex: path/hg19/chromosomes/
chrLenFile	file with chromosome lengths chromosomes that are not in this list won't be considered by Control-FREEC!	Ex: path/hg18.len or path/hg19.len (!) files of type hg19.fa.fai are also accepted starting from v9.3
coefficientOfVariation	coefficient of variation to evaluate necessary window size	Default: 0.05
contamination	a priori known value of tumor sample contamination by normal cells	Ex: contamination=0.25 set "contaminationAdjustment=TRUE" to correct for the contamination Default: contamination=0
contaminationAdjustment	set TRUE to correct for contamination by normal cells. If "contamination" is not provided, it will automatically evaluate the level of contamination	Default: FALSE
degree	degree of polynomial	Default: 3&4 (GC-content based normalization, WGS) or 1 (control-read-count-based normalization, WES)
forceGCcontentNormalization	set to 1 or 2 to correct the Read Count (RC) for GC-content bias and low mappability even when you have a control sample	0: simply model "sample RC ~ Control RC" 1: normalize the sample and the control RC using GC-content and then calculate the ratio "Sample RC/contol RC" 2: model "sample RC ~ Control RC" bias, and then normalize for GC-content Default (WGS): 0 Default (WES): 1 (≥ v9.5) and 0 (< v9.5)
GCcontentProfile	GC-content profile for a given window-size (higher priority than chrFiles) Optional! Necessary only if both a control dataset or chromosome sequences (.fasta for the genome of interest) are not available	Ex: path/GC_profile.cnp
gemMappabilityFile	.gem file with information about mappable positions (GEM output)	Ex: out76.gem
intercept	intercept of polynomial	Default: 1 - with GC-content, 0 - with a control dataset
minCNAlength	minimal number of consecutive windows to call a CNA	Default: 3 for WES and 1 for WGS
minMappabilityPerWindow	only windows with fraction of mappable positions higher than or equal to this threshold will be considered (if "gemMappabilityFile" is not provided, one uses the percentage of non-N letters per window)	Default: 0.85
minExpectedGC	minimal expected value of the GC-content for the prior evaluation of "Read Count ~ GC-content" dependency	Default: 0.35 (change only if you run Control-FREEC on a bacterial genome)
maxExpectedGC	maximal expected value of the GC-content for the prior evaluation of "Read Count ~ GC-content" dependency	Default: 0.55 (change only if you run Control-FREEC on a bacterial genome)
minimalSubclonePresence	detects subclones present in x% of cell population	Default: 100 (meaning "do not look for subclones") Suggested: 20 (or 0.2) for WGS and 30 (or 0.3) for WES
maxThreads	number of threads (multi-threading mode)	Default: 1
noisyData	set TRUE for target resequencing data (e.g., exome-seq) to avoid false positive predictions due to non-uniform capture	Default: FALSE
outputDir	output directory	Default: "./" (should be an existing folder)
ploidy	genome ploidy; In case of doubt, you can set different values and Control-FREEC will select the one that explains most observed CNAs	Ex: ploidy=2 or ploidy=2,3,4
printNA	set FALSE to avoid printing "-1" to the _ratio.txt files Useful for exome-seq or targeted sequencing data	Default: TRUE
readCountThreshold	threshold on the minimal number of reads per window in the control sample Useful for exome-seq or targeted sequencing data	Default: 10 recommended value >=50 for for exome data
sambamba	path to sambamba (used only to read .BAM files)	Ex: "sambamba = /local/sambamba/sambamba"; Default: will use samtools
SambambaThreads	number of threads for Sambamba to use (instead of Samtools)	Defaut: the value of "maxThreads"
samtools	path to samtools (used only to read .BAM files or to create .pileup files when option "makepileup" is set)	Ex: "samtools = /local/SAMtools/samtools"; Default: samtools
sex	sample sex	"sex=XX" will exclude chr Y from the analysis "sex=XY" will not annotate one copy of chr X and Y as a loss.
step	step (used only when "window" is specified); do not use for exome sequencing (instead set "window=0")	Ex: 10000
telocentromeric (!there was a misspeling in all versions up to 6.3)	length of pre-telomeric and pre-centromeric regions: Control-FREEC will not output small CNAs and LOH found within these regions (they are likely to be false because of mappability/genome assembly issues) 50000 is OK for human/mouse genomes. Use smaller values for yeasts and flies.	Default: 50000
uniqueMatch	Use a mappability profile to correct read counts (in this case a mappability file must be provided with "gemMappabilityFile" )	Default: FALSE
window	explicit window size (higher priority than coefficientOfVariation);	Ex: for whole genome sequencing: "window=50000" for whole exome sequencing: "window=0"

The parameters chrLenFile and ploidy are required. Either chrFiles or GCcontentProfile must be specified too if no control dataset is available.

If you provide a path to chromosome files, Control-FREEC will look for the following fasta files in your directory (in this order): 1, 1.fa, 1.fasta, chr1.fa, chr1.fasta; 2, 2.fa, etc. Please ensure that you don't have other files but sequences having the listed names in this directory.

The parameter breakPointThreshold specifies the maximal slope of the slope of residual sum of squares. This should be a positive value. The closer it is to Zero, the more breakpoints will be called. Its recommended value is between 0.1 and 1.2.

If the parameter gemMappabilityFile is not specified, then the fraction of non-N nucleotides per window is used as Mappability. To see the effect of using a mappability profile, compare two resuted copy number profiles for NA18507 (a male Yoruba from Ibadan): without mappability information and with a mappability profile 35bp up to 2 mismatches, hg19.
At this point, gemMappabilityFile can be used only in the mode without control sample.

[sample] and [control] groups of parameters:

parameter	description	possible values
mateFile	file with mapped reads (can be single end reads, mate-pairs or paired-end reads)	Ex: path/sample.bam
mateCopyNumberFile	raw copy number profile for a given window-size (higher priority than mateFile) - OPTIONAL (don't need to provide a mateFile if mateCopyNumberFile is provided unless you want to have BAF profiles)	Ex: path/sample.cnp
miniPileup	miniPileup file created from the corresponding BAM file dring a previous run of Control-FREEC - OPTIONAL providing this file will significantly speed up the whole process	Ex: path/sample_minipileup.pileup
inputFormat	format of reads (in mateFile)	SAM, BAM, pileup, bowtie, eland, arachne, psl (BLAT), BED, Eland. All formated except BAM are also accepted GZipped.
mateOrientation	format of reads (in mateFile)	0 (for single ends), RF (Illumina mate-pairs), FR (Illumina paired-ends), FF (SOLiD mate-pairs)

Either mateFile or mateCopyNumberFile must be specified. If you already have read copy number (.cpn) files for mateCopyNumberFile, you can complement it with minipileup file miniPileup or provide mateFile to use it for the [BAF] option. Minipileups are created by Control-FREEC when you run it with the makePileup option of the [BAF] group of parameters.

The parameters inputFormat and mateOrientation are required if mateFile is specified.

If you specify orientation of your reads ("FR","RF", or "FF") then only reads mapping in the corresponding orientation will be used for calculation of copy number profiles.

[BAF] group of parameters:

parameter	description	possible values
makePileup	path to a BED or VCF file with SNP positions to create a mini pileup file from the initial BAM file provided in `mateFile`; if provided, a BAM file can be used for the calculation of BAF instead of .pileup.gz file	Optional; if not used, a pileup file should be provided as `mateFile`. Such .bed files can be created from hg19_snp142.SingleDiNucl.1based.txt.gz using a command line `gunzip -c hg19_snp142.SingleDiNucl.1based.txt.gz \| awk {'printf ("%s\t%s\t%s\n", $1,$2-1,$2)'} >hg19_snp142.SingleDiNucl.1based.bed`
fastaFile	one fasta file for the whole genome	Optional; need to be provided only when option "makePileup" is used to create a minipileup file from .BAM Ex: path/hg19.fa
minimalCoveragePerPosition	minimal read coverage for a position to be considered in BAF analysis	Default: 0
minimalQualityPerPosition	minimal sequencing quality for a position to be considered in BAF analysis	Default: 0; using this option can slow down reading of .pileup
shiftInQuality	basis for Phred quality	Default: 0; usually 33 or 64; see fastq quality
SNPfile	file with known SNPs	Ex: path/hg19_snp142.SingleDiNucl.1based.txt.gz .vcf or .vcf.gz files are also accepted in >v9.3

[target] group of parameters:

parameter	description	possible values
captureRegions	file with capture regions in .BED format	sorted .BED file should contain the following columns: chr 0-based start 1-based end

See an example of a config file for whole genome sequencing here and for whole exome sequencing here. Other examples are provided in test.zip and testChr19.zip.

Control-FREEC's output

_info.txt: parsable file with information about FREEC run (available from v10.3). Information included in the file:

Program_Version
Sample_Name
Control_Used - True if Control-FREEC used a matching control file for data analysis
CGcontent_Used - True if Control-FREEC used GC-content to normalize read counts
Mappability_Used - True if Control-FREEC used read mappability data to exclude low mappability regions
Looking_For_Subclones - True if Control-FREEC was looking for subclonal CNAs
Breakpoint_Threshold
Window - 0 if Control-FREEC did not use the window approach but counted reads per exon/amplicon
Number_Of_Reads|Pairs_In_Sample
Number_Of_Reads|Pairs_In_Control
Output_Ploidy
Sample_Purity - 1 if Control-FREEC did not find contamination or was nor looking for it
Good_Polynomial_Fit - True if all EM converged

_CNVs: file with coordinates of predicted copy number alterations. Information for each region:

chromosome
start position
end position
predicted copy number
type of alteration
genotype (if [BAF] option is used)
* percentage of uncertainty of the predicted genotype (max=100). (!) It is **not** the uncertainty of a CNA
*&** status (somatic or germline)
*&** percentage of germline CNA/LOH

_ratio.txt: file with ratios and predicted copy number alterations for each window. Information for each window:

chromosome
start position
ratio
median ratio for the whole fragment resulted from segmentation
predicted copy number
* median(abs(BAF-0.5)) per window
* estimated BAF
* genotype
* percentage of uncertainty of the predicted genotype, max=100
***** copy number in the subclone
***** subclone prevalence

*_BAF.txt: file B-allele frequencies for each possibly heterozygous SNP position. Information for each window:

chromosome
start position
BAF
fitted frequency of A-allele
fitted frequency of B-allele
inferred frequency of A-allele
inferred frequency of B-allele
percentage of uncertainty of the predicted genotype, max=100.

_sample.cnp and _control.cnp files: files with raw copy number profiles. Information for each window:

chromosome
start position
number of read starts

GC_profile.cnp: file with GC-content profile. Information for each window:

chromosome
start position
GC-content
percentage of ACGT-letter per window (1-poly(N)%)
*** percentage of uniquely mappable positions per window

****_ratio.BedGraph: file with ratios in BedGraph format for visualization in the UCSC genome browser. The file contains tracks for normal copy number, gains and losses, and copy neutral LOH (*). Information for each window:

chromosome
start position
end position
ratio*ploidy

* if [BAF] option is used
** if [control] option is used
*** if gemMappabilityFile is used
**** if [general] BedGraphOutput=TRUE (see config file options)
***** if minimalSubclonePresence is used

Calculate significance of Control-FREEC predictions with an R script

One can add the p-values to predicted CNVs by running assess_significance.R:

cat assess_significance.R | R --slave --args < *_CNVs > < *_ratio.txt >

R and package rtracklayer should be installed!

Running this script will add both Wilcoxon test and Kolmogorov-Smirnov test p-values to each line of _CNVs file.

It will also add headers to the columns of _CNVs.

Visualize Control-FREEC's output

If you work with Exome-seq data and you did not set printNA=FALSE, delete data points that are not in targeted regions from *_ratio.txt:

awk '$3!=-1 {print}' $outdir/$sample.pileup.gz_normal_ratio.txt > $outdir/$sample.pileup.gz_normal_ratio_noNA.txt

				awk '$3!=-1 {print}' $outdir/$sample.pileup.gz_ratio.txt > $outdir/$sample.pileup.gz_ratio_noNA.txt

then use *_ratio_noNA.txt files instead of *_ratio.txt in the example below.

You can visalize normalized copy number profile with predicted CNAs as well as BAF profile by running makeGraph2.0.R:

cat makeGraph2.0.R | R --slave --args <*_ratio.txt> [<*_BAF.txt>]

R should be installed!

Example of visualization:

Translate Control-FREEC's output into Bed or Circos formats

To translate an output of Control-FREEC to a .BED file (so to view it in the UCSC Genome Browser) or to the Circos format, use these scripts: freec2bed.pl and freec2circos.pl

Generate a GC-content profile without running Control-FREEC

If for some reason you want to separately generate a GC-content profile (GC_profile.cpn) and run Control-FREEC on your read file, you can use gccount (for Linux, 64bit) to generate "GC_profile.cpn". You can use the same config file as for the main application. No need to delete or comment anything. However, window size (option "window") is mandatory. "step" and "gemMappabilityFile" are optional.

Run FREEC on a something which is not Human

The crucial thing to successfully run FREEC on your data is to

Create a file with chromosome lengths (it can be, for example, mm9.fa.fai)
Change the "minExpectedGC" and "maxExpectedGC" parameters to fit your data
Change the value of the "telocentromeric" parameter
Change chromosome names in "makeGraph2.0.R"

Create a file with chromosome lengths:

If you do not know chromosome lengths for your genome of interest, you can calculate them using this script: get_fasta_lengths.pl

To run this script:

if you have all your chromosomes in one multifasta file "MyGenome.fa", then type
perl get_fasta_lengths.pl MyGenome.fa
if you have all the chromosomes in separate fasta files in the directory "MyGenome", then type
cat MyGenome/*.fa* > MyGenome.fa perl get_fasta_lengths.pl MyGenome.fa rm MyGenome.fa

Change the "minExpectedGC" and "maxExpectedGC" paramters:

For the Drosophila genome set:

minExpectedGC = 0.3 and maxExpectedGC = 0.45.

Also it seems to be necessary to remove all the "Het" chromosomes from the analysis by removing them from the chromosome lengths file (thanks to Joel McManus).

Change the value of the "telocentromeric" parameter:

telocentromeric=50000 is OK for human/mouse genomes. Use smaller values for yeasts and flies.

Do not set anything for exome-seq data.

Change chromosome names in "makeGraph2.0.R":

In the case your genome is not a human genome and you do not have chromosome names 1:22 and X and Y, open "makeGraph2.0.R" with a text editor and change chromosome names. Chromosome names in the "makeGraph2.0.R" script should correspond to the chromosome names in the .ratio.txt file.