Ordering_Model/README.txt · SherlockLung-EvolutionaryTrajectory-Analysis

######################
# SETUP REQUIREMENTS #
######################

The main script to run this pipeline is BASH/pipeline.sh

pipeline.sh contains a number of 'qsub' commands, written for a Sun Grid Engine (SGE) cluster scheduler or similar. If you are using a different scheduler (e.g. Slurm, PBS), you will probably need to rewrite these commands to use your scheduler's syntax. Note that, as well as updating "qsub" to the appropriate command, you will need to ensure all argument handles are updated accordingly (e.g. changing the SGE handle -hold-jid to --dependency=afterok:job_id if using Slurm).

All other .sh files within the BASH directory begin with two lines that will almost certainly also need to be changed.

1st line:
#!/bin/bash --login
This line tells SGE how to interpret the script. You may need to replace or remove this line, as appropriate for your scheduler.

2nd line:
module load apps/gcc/R/4.1.2
This line loads the R module. You will need to replace this line with the appropriate command to load an R module in your own system.
Theoretically, any version from 4 up should work (and most likely many earlier versions, but this has not been tested), as long as all required packages are installed and available to that version.

The following R packages must be installed on the system and available to the user running the pipeline:

dplyr
GenomicRanges
ggplot2
data.table
cowplot
copynumber
reshape2
tidyr
PlackettLuce
PLMIX (required if the user wishes to perform de-novo discovery of subgroups)

Some further packages are required for the generation of plots:

magrittr
gtools
biomaRt

All of the above packages can be installed by running the following command (or equivalent for your cluster):

qsub /path/to/Timing_Model/BASH/install_R_packages.sh /path/to/Ordering_Model/R/

An additional change which may be necessary is in TM_1_collate_all_subclone_files_CW.R. This contains the following line, at the top of the script:
file_pattern="subclones.txt"
This may need to be replaced if your Battenberg outputs have a different suffix, as is the case for some Battenberg versions.
This should become an amendable argument in pipeline.sh in future but for now it needs to be updated manually in the R script.

You may encounter errors during the ordering stage of de novo discovery. If so, these will likely be due to known issues in PLMIX. This can be remedied by downloading and installing our patched version: PLMIX.CW from https://github.com/christopherwirth/PLMIX.CW
You will then need to amend line 86 of Ordering_Model/R/TM_postFDR_5_ordering_events_final_CW.R to:

if (PLmethod == "PLMIX") {
  library(PLMIX.CW)
} else {
  library(PLmethod,character.only = T)
}



########################
# RUNNING THE PIPELINE #
########################

In its current form, the pipeline requires quite a few arguments. An example of how to run the pipeline for a pre-determined sample set is as follows (to be run from the login node):

/path/to/Ordering_Model/BASH/pipeline.sh -n Example_Set -i /path/to/location/of/subclones_files/ -o /path/to/Ordering_Model/Example_Set/ -p PlackettLuce -g 1 -u notIncluded -r /path/to/Ordering_Model/reference_files/Example_Set_driver_mutations.txt -m /path/to/Ordering_Model/reference_files/Example_Set_nonsynonymous_mutations.txt -d /path/to/DPClust_data/ -w /path/to/Ordering_Model/reference_files/Example_Set_WGDStatus.txt -s /path/to/Ordering_Model/sample_sets/Example_Set_barcodes.txt -c /path/to/Winners_curse_corrected_data/ -a hg38

The following arguments are required:

-n : dataset name
-i : directory containing input subclones files (Battenberg outputs, or copy number calls formatted to look like them)
-o : directory to use as output folder
-p : Plackett-Luce R package to use (one of PlackettLuce or PLMIX)
-g : G i.e. number of subgroups - can be a single number or a comma separated list of numbers - note this argument is ignored unless the previous argument is PLMIX

The following arguments must also be set if you wish to include mutation data in the model (otherwise, -M must be set to FALSE - see below):

-r : file containing a list of driver mutation gene symbols to include in the analysis
-m : file containing nonsynonymous mutations
-d : directory including DPClust data

The following arguments are optional:

-M : whether or not to include mutation data in the model - if specified, it should be one of TRUE or FALSE; default TRUE
-u : how to handle unobserved events - one of notIncluded, end, or interspersed; default notIncluded
-w : file stating WGD status of each sample; if not specified, the pipeline will attempt to infer WGD status based on ploidy
-s : file containing a list of samples to include; if not specified, the pipeline will use all available samples with a subclones file in the input directory
-c : directory containing winners' curse corrected output data; if not specified, non-corrected CCFs will be used
-a : genome assembly - if specified, should be one of hg19 or hg38; default hg38
-S : memory saver mode - datasets with large numbers of low-CCF subclonal events can cause memory errors; in this case set -S TRUE; default FALSE
-G : all gene GRanges file - defaults to the hg38 file supplied in the references directory
-b : Battenberg output file suffix - default "subclones.txt"
-e : estimate the aggregate fixation point of the MRCA (includes MRCA as an additional "event" between clonal and subclonal events within each sample) - default FALSE

For each sample, there should be a file ending in "subclones.txt" (or an alternative suffix that you provide with -b) somewhere within the directory provided in the -i argument (in this example, /path/to/location/of/subclones_files/).

The directory in the -d argument should contain files with the following patterns:
 - _dpclust_input_data.txt
 - _bestConsensusAssignments.bed
 and, IF there is no winner's curse corrected directory provided:
 - _bestClusterInfo.txt

If winner's curse correction has been carried out, for each sample, the directory specified in the -c argument should contain a file with the pattern _WCC_ccf_nMut.txt

The above example should run the whole timing model for a set of samples, working on the assumption that they share a single timing trajectory.

Example files are included in the reference_files directory, to illustrate the format of:
- subclones files
- DPClust input and output data files
- a driver list
- a nonsynonymous mutations file (NOTE: although the example nonsynonymous mutations file includes many columns, only the first 4 are required for this pipeline: Tumor_Barcode, Chromosome, Start_Position, and End_Position - but they must be correctly named)
- a WGD status file
- a sample list (this file can actually be found in the sample_sets directory rather than reference_files)
To reiterate: these files are not intended to constitute a runnable example. They are intended to be illustrative of the required file formats.


#####################
# DE NOVO DISCOVERY #
#####################

If you wish to use the timing model to discover de novo subsets of samples, which have differing trajectories, this can be done in three stages:

1: Run the pipeline with PLMIX as the R method, and various different values of G provided as a comma-separated list, i.e.

/path/to/Ordering_Model/BASH/pipeline.sh -n Example_Set -i /path/to/location/of/subclones_files/ -o /path/to/Ordering_Model/Example_Set/ -p PLMIX -g 1,2,3,4,5 -u notIncluded -r /path/to/Ordering_Model/reference_files/Example_Set_driver_mutations.txt -m /path/to/Ordering_Model/reference_files/Example_Set_nonsynonymous_mutations.txt -d /path/to/DPClust_data/ -w /path/to/Timing_Model/reference_files/Example_Set_WGDStatus.txt -s /path/to/Ordering_Model/sample_sets/Example_Set_barcodes.txt -c /path/to/Winners_curse_corrected_data/

2: Perform an additional step to identify the best G and discover the samples in each subset. On an SGE cluster this can be done via

qsub /path/to/Timing_Model/BASH/run_postFDR_5b.sh /path/to/Timing_Model/R/TM_postFDR_5b_select_best_G.R Example_Set /path/to/Timing_Model/Example_Set/

This will generate n .txt files in the output folder named Example_Set_sample_group_1.txt to Example_Set_sample_group_n.txt (where n is the best value of G).

3: Re-run the pipeline for each subset, e.g. if n=3

for i in 1 2 3; do
  /path/to/Ordering_Model/BASH/pipeline.sh -n Example_Set_DN${i} -i /path/to/location/of/subclones_files/ -o /path/to/Ordering_Model/Example_Set_DN${i}/ -p PlackettLuce -g 1 -u notIncluded -r /path/to/Ordering_Model/reference_files/Example_Set_driver_mutations.txt -m /path/to/Ordering_Model/reference_files/Example_Set_nonsynonymous_mutations.txt -d /path/to/DPClust_data/ -w /path/to/Ordering_Model/reference_files/Example_Set_WGDStatus.txt -s /path/to/Ordering_Model/sample_sets/Example_Set_sample_group_${i}.txt -c /path/to/Winners_curse_corrected_data/
done


####################
# GENERATING PLOTS #
####################

The final stage - generating plots - is not yet properly integrated into the pipeline and so must be run separately. This is done by running e.g.

qsub /path/to/Ordering_Model/BASH/generate_timing_plot.sh /path/to/Timing_Model/R/ Example_Set_DN2 /path/to/Ordering_Model/Example_Set_DN2/ /path/to/Ordering_Model/reference_files/Example_Set_driver_mutations.txt hg38 /path/to/Ordering_Model/reference_files/Example_Set_WGDStatus.txt

NOTE: Depending on how your cluster works, it may be IMPORTANT to include FULL PATHS to each file/directory in the arguments


###############################
# A NOTE ON UNOBSERVED EVENTS #
###############################

Unobserved events are one of the more complex and contentious factors of the timing model. By "unobserved events" we mean, for a given sample, an event (i.e. CNA or mutation) that has been identified as enriched among the whole cohort, but has not been observed in the current sample.

We can set these to be effectively left out entirely, which is the default, and is what has been done in the examples above. Alternatively, we can make the assumption that all events that are observed across the cohort would occur in all samples eventually, but the "unobserved" ones simply have not been observed in the current sample *yet*. In this case, we can choose to include the unobserved events after all of the observed events. In practice this has the effect of ordering events largely by their prevalence: an event that is observed early in 5% of samples and unobserved in 95% of samples will be timed very late overall, despite our only *observed* evidence being that the event occurred early. However, there is an argument that the underlying assumption (i.e. that all non-mutually-exclusive events would occur eventually and therefore unobserved ones would occur at a later time) may be a reasonable assumption to employ after subgroups have been discovered.

A third option is to include the unobserved events by interspersing them at random times among the course of the observed events.


#################
# ANYTHING ELSE #
#################

Feel free to contact me if you have any questions or issues:
christopher.wirth@manchester.ac.uk