Control-FREEC: Prediction of copy number alterations and loss of heterozygosity using deep-sequencing data


Manual


Control-FREEC is a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data originally developed by the Bioinformatics Laboratory of Institut Curie (Paris). Nowdays, Control-FREEC is supported by the team of Valentina Boeva at Institut Cochin, Inserm (Paris).

Control-FREEC automatically computes, normalizes, segments copy number and beta allele frequency (BAF) profiles, then calls copy number alterations and LOH. The control (matched normal) sample is optional for whole genome sequencing data but mandatory for whole exome or targeted sequencing data. For whole genome sequencing data analysis, the program can also use mappability data (files created by GEM).

Starting from version v8.0, we provide a possibility to detect subclonal gains and losses and evaluate the likeliest average ploidy of the sample. Also, the procedure for evaluation of tumor purity has been improved.

Starting from Control-FREEC v5.0, the program can be used on exome-sequencing data. Starting from version v8.0, read counts are calculated by exon and not per window (set "window=0").

Starting from Control-FREEC v6.0, the user can use multiple threads to run Control-FREEC. 30x coverage WGS data with a control (i.e., two pileup.gz files) will be fully processed (CNA and LOH info) in one hour using 6 threads.


Input for CNA detection: aligned single-end, paired-end or mate-pair data in SAM, BAM, SAMtools pileup.
Control-FREEC accepts .GZ files. Support of Eland, BED, SOAP, arachne, psl (BLAT) and Bowtie formats has been discontinued starting from version v8.0.
Input for CNA+LOH detection: There are two options: (a) provide aligned reads in SAMtools pileup format. Files can be GZipped; (b) provide BAM files together with options "makePileup" and "fastaFile" (see How to create a config file?)
Output: Regions of gain, loss and LOH, normalized copy number and BAF profiles.

Control-FREEC publications

  • Control-FREEC: a tool for assessing copy number and allelic content using next generation sequencing data. V. Boeva, T. Popova, K. Bleakley, P. Chiche, I. Janoueix-Lerosey, O. Delattre and E. Barillot. Bioinformatics, 2012, 28(3):423-5. PMID: 22155870.

    CNA detection part of Control-FREEC (simply FREEC)

  • Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. V. Boeva, A. Zinovyev, K. Bleakley, J.-P. Vert, I. Janoueix-Lerosey, O. Delattre and E. Barillot. Bioinformatics, 2011, 27(2):268-9. PMID: 21081509.
    LOH detection part of Control-FREEC

Control-FREEC HOWTOs:

  1. Install the Control-FREEC package

  2. Run Control-FREEC on a test data

  3. Create a Config file

  4. Read Control-FREEC's output

  5. Calculate significance of Control-FREEC predictions with an R script (new!)

  6. Visualize Control-FREEC's output

  7. Translate Control-FREEC's output into Bed or Circos formats

  8. Generate a GC-content profile without running Control-FREEC

  9. Run Control-FREEC on something which is not Human:

Contact

Valentina Boeva: valentina.boeva [AT] inserm.fr

Phone: +33.1.44.41.23.89

Distributions:

Control-FREEC is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation (www.fsf.org); either version 2 of the License, or (at your option) any later version.

 Test data

  • Human data (hg18) for HCC1143 and HCC1143-BL (from Chiang et al., 2009) [ download, 146 Mb ]

  • Human data (hg19) for a tumor chromosome 19 (unpublished, to use to test the LOH detection component of Control-FREEC) [ download, 1334 Mb ]

Installation:

  1. Requirements:
    • 10Gb of RAM
    • Linux or MACS OS; only older versions of Control-FREEC (up to v6.2) are available for Windows
    • R installed
    • samtools installed if the input files are in .BAM format
    • bedtools installed if you wish to create minipileup files from .BAM
    • sambamba installed if you wish to speed up reading of .BAM files
  2. Download the latest release of Control-FREEC: Linux 64-bit version.

    [Starting from Control-FREEC v5.7 Windows is no longer supported. But just in case, we compile Control-FREEC v6.2 for you: Control-FREEC v6.2 for Windows 32-bit . Also, you can still download Control-FREEC v5.6 for Windows 32-bit or contact me for support].

  3. Unpack Control-FREEC:

    type unzip YOUR_FILE.zip    or

    tar -zxvf YOUR_FILE.tar.gz

  4. The Win32 and Linux64 archives include binary versions of Control-FREEC. If you are using Linux, type make from the Control-FREEC directory to compile the program (change Makefile by removing 64bit-tags if you are running 32bit version of Linux).
  5. Download mappability tracks if you want to include mappability information:
    Do not forget to extract files from the archive! You can also generate a mappability track for other genomes using GEM.
  6. Download files with SNPs (only if you have high coverage data and you want to detect allelic status; then, you must transform read files into pileup format)
    Starting from Control-FREEC v9.3, .txt.gz, .vcf and .vcf.gz files are also accepted! For the .txt files with SNPs, please refer to FREEC FAQ Q19 to understand how these files are generated.
    Starting from Control-FREEC v10.6, Control-FREEC can work on exome-seq data without a control sample.

Running Control-FREEC on a test data

  1. To run Control-FREEC download and unzip:
    • a test dataset for HCC1143 and HCC1143-BL (from Chiang et al., 2009): test.zip (146 M) or
    • a test dataset for tumor chromosome 19 (unpublished, use it to test LOH predictions by Control-FREEC): testChr19.zip (1334 M)
  2. Follow the instructions from the README file.

Creating a Config file

There should be 2-5 groups of parameters in the config file: [general], [sample], [control], [BAF] and [target]. The last three can be empty if there is no control dataset available, if you do not intend to calculate BAF profiles and call genotypes, or if you are not running the program on targeted data (e.g. Whole Exome Sequencing).
  • [general]: here you describe all general parameters. Most of them are optional; their default value is indicated below.
  • [sample]: here you provide a path to the sample input file and indicate its type (e.g., bam, sam, pileup).
  • [control]: here you provide a path to the contol input file and indicate its type. You may have this group empty if you do not have a control dataset.
  • [BAF]: here you provide paths to files and parameters to calculate B allele frequency (BAF) profiles and detect LOH regions.
  • [target]: here you can provide a .bed file with coordinates of probes, exons or amplicons when you run Control-FREEC on exome sequencing data or other targeted sequencing data. Set "window=0" in the [general] to use read count "per exon" instead of "per window".

[general] group of parameters:

parameterdescriptionpossible values
BedGraphOutputset "BedGraphOutput=TRUE" if you want an additional output in BedGraph format for the UCSC genome browserDefault: FALSE
bedtoolspath to bedtools (used only to create .pileup files for WES data)Default: bedtools
breakPointThresholdpositive value of threshold for segmentation of normalized profilesDefault: 0.8
use something like 0.6 to get more segments (and thus more predicted CNVs)
breakPointTypedesired behavior in the ambiguous regions (poly-N or low mappability regions between two different copy number values) 0: the "unknown" region is attached to the "known" region on the right
1: make a separate fragment of this “unknown” region and then attaches it to the left or to the right region choosing the longer one
2: make a separate fragment of this “unknown” region and then attaches it to the left or to the right region but the “ploidy” copy number has a priority
3: make a separate fragment of this “unknown” region and then attaches it to the left or to the right region choosing the longer one but this “known” region should make at least half-size of the “unknown” region
4: make a separate fragment of this “unknown” region and do not assign any copy number to this region at all
Default: 2
chrFilespath to the directory with chromosomes fasta files
(necessary to calculate a GC-content profile if a control dataset and GC-content profile are not available)
Ex: path/hg19/chromosomes/
chrLenFilefile with chromosome lengths
chromosomes that are not in this list won't be considered by Control-FREEC!
Ex: path/hg18.len or path/hg19.len
(!) files of type hg19.fa.fai are also accepted starting from v9.3
coefficientOfVariationcoefficient of variation to evaluate necessary window sizeDefault: 0.05
contaminationa priori known value of tumor sample contamination by normal cells Ex: contamination=0.25
set "contaminationAdjustment=TRUE" to correct for the contamination
Default: contamination=0
contaminationAdjustment
set TRUE to correct for contamination by normal cells.
If "contamination" is not provided, it will automatically evaluate the level of contamination
Default: FALSE
degreedegree of polynomialDefault: 3&4 (GC-content based normalization, WGS) or 1 (control-read-count-based normalization, WES)
forceGCcontentNormalizationset to 1 or 2 to correct the Read Count (RC) for GC-content bias and low mappability even when you have a control sample0: simply model "sample RC ~ Control RC"
1: normalize the sample and the control RC using GC-content and then calculate the ratio "Sample RC/contol RC"
2: model "sample RC ~ Control RC" bias, and then normalize for GC-content

Default (WGS): 0
Default (WES): 1 (≥ v9.5) and 0 (< v9.5)
GCcontentProfileGC-content profile for a given window-size (higher priority than chrFiles)
Optional! Necessary only if both a control dataset or chromosome sequences (.fasta for the genome of interest) are not available
Ex: path/GC_profile.cnp
gemMappabilityFile.gem file with information about mappable positions (GEM output)Ex: out76.gem
interceptintercept of polynomialDefault: 1 - with GC-content,
0 - with a control dataset
minCNAlengthminimal number of consecutive windows to call a CNADefault: 3 for WES and 1 for WGS
minMappabilityPerWindowonly windows with fraction of mappable positions higher than or equal to this threshold will be considered
(if "gemMappabilityFile" is not provided, one uses the percentage of non-N letters per window)
Default: 0.85
minExpectedGCminimal expected value of the GC-content for the prior evaluation of "Read Count ~ GC-content" dependency Default: 0.35 (change only if you run Control-FREEC on a bacterial genome)
maxExpectedGCmaximal expected value of the GC-content for the prior evaluation of "Read Count ~ GC-content" dependency Default: 0.55 (change only if you run Control-FREEC on a bacterial genome)
minimalSubclonePresencedetects subclones present in x% of cell populationDefault: 100 (meaning "do not look for subclones")
Suggested: 20 (or 0.2) for WGS and 30 (or 0.3) for WES
maxThreadsnumber of threads (multi-threading mode)Default: 1
noisyDataset TRUE for target resequencing data (e.g., exome-seq) to avoid false positive predictions due to non-uniform captureDefault: FALSE
outputDiroutput directoryDefault: "./" (should be an existing folder)
ploidygenome ploidy; In case of doubt, you can set different values and Control-FREEC will select the one that explains most observed CNAsEx: ploidy=2 or ploidy=2,3,4
printNAset FALSE to avoid printing "-1" to the _ratio.txt files
Useful for exome-seq or targeted sequencing data
Default: TRUE
readCountThresholdthreshold on the minimal number of reads per window in the control sample
Useful for exome-seq or targeted sequencing data
Default: 10
recommended value >=50 for for exome data
sambambapath to sambamba (used only to read .BAM files)Ex: "sambamba = /local/sambamba/sambamba"; Default: will use samtools
SambambaThreadsnumber of threads for Sambamba to use (instead of Samtools)Defaut: the value of "maxThreads"
samtoolspath to samtools (used only to read .BAM files or to create .pileup files when option "makepileup" is set)Ex: "samtools = /local/SAMtools/samtools"; Default: samtools
sexsample sex"sex=XX" will exclude chr Y from the analysis
"sex=XY" will not annotate one copy of chr X and Y as a loss.
stepstep (used only when "window" is specified); do not use for exome sequencing (instead set "window=0")Ex: 10000
telocentromeric
(!there was a misspeling in all versions up to 6.3)
length of pre-telomeric and pre-centromeric regions: Control-FREEC will not output small CNAs and LOH found within these regions (they are likely to be false because of mappability/genome assembly issues)
50000 is OK for human/mouse genomes. Use smaller values for yeasts and flies.
Default: 50000
uniqueMatchUse a mappability profile to correct read counts (in this case a mappability file must be provided with "gemMappabilityFile" )Default: FALSE
windowexplicit window size (higher priority than coefficientOfVariation); Ex:
for whole genome sequencing: "window=50000"
for whole exome sequencing: "window=0"


The parameters chrLenFile and ploidy are required. Either chrFiles or GCcontentProfile must be specified too if no control dataset is available.


If you provide a path to chromosome files, Control-FREEC will look for the following fasta files in your directory (in this order): 1, 1.fa, 1.fasta, chr1.fa, chr1.fasta; 2, 2.fa, etc. Please ensure that you don't have other files but sequences having the listed names in this directory.


The parameter breakPointThreshold specifies the maximal slope of the slope of residual sum of squares. This should be a positive value. The closer it is to Zero, the more breakpoints will be called. Its recommended value is between 0.1 and 1.2.


If the parameter gemMappabilityFile is not specified, then the fraction of non-N nucleotides per window is used as Mappability. To see the effect of using a mappability profile, compare two resuted copy number profiles for NA18507 (a male Yoruba from Ibadan): without mappability information and with a mappability profile 35bp up to 2 mismatches, hg19.
At this point, gemMappabilityFile can be used only in the mode without control sample.


[sample] and [control] groups of parameters:

parameterdescriptionpossible values
mateFilefile with mapped reads (can be single end reads, mate-pairs or paired-end reads)Ex: path/sample.bam
mateCopyNumberFileraw copy number profile for a given window-size (higher priority than mateFile) - OPTIONAL
(don't need to provide a mateFile if mateCopyNumberFile is provided unless you want to have BAF profiles)
Ex: path/sample.cnp
miniPileupminiPileup file created from the corresponding BAM file dring a previous run of Control-FREEC - OPTIONAL
providing this file will significantly speed up the whole process
Ex: path/sample_minipileup.pileup
inputFormatformat of reads (in mateFile)SAM, BAM, pileup, bowtie, eland, arachne, psl (BLAT), BED, Eland. All formated except BAM are also accepted GZipped.
mateOrientationformat of reads (in mateFile)0 (for single ends), RF (Illumina mate-pairs),
FR (Illumina paired-ends), FF (SOLiD mate-pairs)


Either mateFile or mateCopyNumberFile must be specified. If you already have read copy number (.cpn) files for mateCopyNumberFile, you can complement it with minipileup file miniPileup or provide mateFile to use it for the [BAF] option. Minipileups are created by Control-FREEC when you run it with the makePileup option of the [BAF] group of parameters.

The parameters inputFormat and mateOrientation are required if mateFile is specified.

If you specify orientation of your reads ("FR","RF", or "FF") then only reads mapping in the corresponding orientation will be used for calculation of copy number profiles.

[BAF] group of parameters:

parameterdescriptionpossible values
makePileuppath to a BED or VCF file with SNP positions to create a mini pileup file from the initial BAM file provided in mateFile;
if provided, a BAM file can be used for the calculation of BAF instead of .pileup.gz file
Optional;
if not used, a pileup file should be provided as mateFile.
Such .bed files can be created from hg19_snp142.SingleDiNucl.1based.txt.gz using a command line
gunzip -c hg19_snp142.SingleDiNucl.1based.txt.gz | awk {'printf ("%s\t%s\t%s\n", $1,$2-1,$2)'} >hg19_snp142.SingleDiNucl.1based.bed
fastaFileone fasta file for the whole genomeOptional;
need to be provided only when option "makePileup" is used to create a minipileup file from .BAM
Ex: path/hg19.fa
minimalCoveragePerPositionminimal read coverage for a position to be considered in BAF analysisDefault: 0
minimalQualityPerPositionminimal sequencing quality for a position to be considered in BAF analysisDefault: 0; using this option can slow down reading of .pileup
shiftInQualitybasis for Phred qualityDefault: 0; usually 33 or 64; see fastq quality
SNPfilefile with known SNPsEx: path/hg19_snp142.SingleDiNucl.1based.txt.gz
.vcf or .vcf.gz files are also accepted in >v9.3


[target] group of parameters:

parameterdescriptionpossible values
captureRegionsfile with capture regions in .BED formatsorted .BED file should contain the following columns:
chr   0-based start   1-based end



See an example of a config file for whole genome sequencing here and for whole exome sequencing here. Other examples are provided in test.zip and testChr19.zip.

Control-FREEC's output


_info.txt: parsable file with information about FREEC run (available from v10.3). Information included in the file:
  • Program_Version
  • Sample_Name
  • Control_Used - True if Control-FREEC used a matching control file for data analysis
  • CGcontent_Used - True if Control-FREEC used GC-content to normalize read counts
  • Mappability_Used - True if Control-FREEC used read mappability data to exclude low mappability regions
  • Looking_For_Subclones - True if Control-FREEC was looking for subclonal CNAs
  • Breakpoint_Threshold
  • Window - 0 if Control-FREEC did not use the window approach but counted reads per exon/amplicon
  • Number_Of_Reads|Pairs_In_Sample
  • Number_Of_Reads|Pairs_In_Control
  • Output_Ploidy
  • Sample_Purity - 1 if Control-FREEC did not find contamination or was nor looking for it
  • Good_Polynomial_Fit - True if all EM converged

_CNVs: file with coordinates of predicted copy number alterations. Information for each region:
  • chromosome
  • start position
  • end position
  • predicted copy number
  • type of alteration
  • genotype (if [BAF] option is used)
  • * percentage of uncertainty of the predicted genotype (max=100). (!) It is **not** the uncertainty of a CNA
  • *&** status (somatic or germline)
  • *&** percentage of germline CNA/LOH

_ratio.txt: file with ratios and predicted copy number alterations for each window. Information for each window:
  • chromosome
  • start position
  • ratio
  • median ratio for the whole fragment resulted from segmentation
  • predicted copy number
  • * median(abs(BAF-0.5)) per window
  • * estimated BAF
  • * genotype
  • * percentage of uncertainty of the predicted genotype, max=100
  • ***** copy number in the subclone
  • ***** subclone prevalence

*_BAF.txt: file B-allele frequencies for each possibly heterozygous SNP position. Information for each window:
  • chromosome
  • start position
  • BAF
  • fitted frequency of A-allele
  • fitted frequency of B-allele
  • inferred frequency of A-allele
  • inferred frequency of B-allele
  • percentage of uncertainty of the predicted genotype, max=100.

_sample.cnp and _control.cnp files: files with raw copy number profiles. Information for each window:
  • chromosome
  • start position
  • number of read starts

GC_profile.cnp: file with GC-content profile. Information for each window:
  • chromosome
  • start position
  • GC-content
  • percentage of ACGT-letter per window (1-poly(N)%)
  • *** percentage of uniquely mappable positions per window

****_ratio.BedGraph: file with ratios in BedGraph format for visualization in the UCSC genome browser. The file contains tracks for normal copy number, gains and losses, and copy neutral LOH (*). Information for each window:
  • chromosome
  • start position
  • end position
  • ratio*ploidy

*   if [BAF] option is used
**  if [control] option is used
*** if gemMappabilityFile is used
**** if [general] BedGraphOutput=TRUE (see config file options)
***** if minimalSubclonePresence is used

Calculate significance of Control-FREEC predictions with an R script

One can add the p-values to predicted CNVs by running assess_significance.R:

cat assess_significance.R | R --slave --args < *_CNVs > < *_ratio.txt >

R and package rtracklayer should be installed!

Running this script will add both Wilcoxon test and Kolmogorov-Smirnov test p-values to each line of _CNVs file.

It will also add headers to the columns of _CNVs.

Visualize Control-FREEC's output

If you work with Exome-seq data and you did not set printNA=FALSE, delete data points that are not in targeted regions from *_ratio.txt:
awk '$3!=-1 {print}' $outdir/$sample.pileup.gz_normal_ratio.txt > $outdir/$sample.pileup.gz_normal_ratio_noNA.txt
awk '$3!=-1 {print}' $outdir/$sample.pileup.gz_ratio.txt > $outdir/$sample.pileup.gz_ratio_noNA.txt
then use *_ratio_noNA.txt files instead of *_ratio.txt in the example below.

You can visalize normalized copy number profile with predicted CNAs as well as BAF profile by running makeGraph.R:

cat makeGraph.R | R --slave --args < ploidy > < *_ratio.txt > [< *_BAF.txt >]

R should be installed!

Example of visualization:

Translate Control-FREEC's output into Bed or Circos formats

To translate an output of Control-FREEC to a .BED file (so to view it in the UCSC Genome Browser) or to the Circos format, use these scripts: freec2bed.pl and freec2circos.pl

Generate a GC-content profile without running Control-FREEC

If for some reason you want to separately generate a GC-content profile (GC_profile.cpn) and run Control-FREEC on your read file, you can use gccount (for Linux, 64bit) to generate "GC_profile.cpn". You can use the same config file as for the main application. No need to delete or comment anything. However, window size (option "window") is mandatory. "step" and "gemMappabilityFile" are optional.

Run FREEC on a something which is not Human

The crucial thing to successfully run FREEC on your data is to

  • Create a file with chromosome lengths (it can be, for example, mm9.fa.fai)
  • Change the "minExpectedGC" and "maxExpectedGC" parameters to fit your data
  • Change the value of the "telocentromeric" parameter
  • Change chromosome names in "makeGraph.R"

Create a file with chromosome lengths:

If you do not know chromosome lengths for your genome of interest, you can calculate them using this script: get_fasta_lengths.pl

To run this script:

  • if you have all your chromosomes in one multifasta file "MyGenome.fa", then type
    perl get_fasta_lengths.pl MyGenome.fa
  • if you have all the chromosomes in separate fasta files in the directory "MyGenome", then type
    cat MyGenome/*.fa* > MyGenome.fa
    perl get_fasta_lengths.pl MyGenome.fa
    rm MyGenome.fa

Change the "minExpectedGC" and "maxExpectedGC" paramters:

For the Drosophila genome set:

        minExpectedGC = 0.3 and maxExpectedGC = 0.45.

Also it seems to be necessary to remove all the "Het" chromosomes from the analysis by removing them from the chromosome lengths file (thanks to Joel McManus).

Change the value of the "telocentromeric" parameter:

telocentromeric=50000 is OK for human/mouse genomes. Use smaller values for yeasts and flies.

Do not set anything for exome-seq data.

Change chromosome names in "makeGraph.R":

In the case your genome is not a human genome and you do not have chromosome names 1:22 and X and Y, open "makeGraph.R" with a text editor and change chromosome names. Chromosome names in the "makeGraph.R" script should correspond to the chromosome names in the .ratio.txt file.