Health

# The co-evolution of the genome and epigenome in colorectal cancer

The co-evolution of the genome and epigenome in colorectal cancer

### Sample collection

Primary tumour tissue and matched blood samples were prospectively collected from patients undergoing curatively intentioned surgery at University College London Hospital (UCLH). All patients gave informed consent for collection of their materials to the UCLH Cancer Biobank (Research Ethics Committee approval 15/YH/0311). Four regions of each primary cancer were sampled by punch biopsy or scalpel dissection, at notionally 12, 3, 6 and 9 o’clock positions around the tumour periphery. Each region was further cut into four pieces and slow-frozen to −80 °C, using a Mr Frosty Freezing Container (Thermo Fisher), in 1 ml of a buffered medium (MEM supplemented with 5% FBS and 0.5% 5 mM HEPES buffer, diluted with 10% dimethylsulfoxide) in a 1.8-ml Nunc Cryotube (Sigma-Aldrich), and immersed in isopropanol to preserve chromatin structure. All investigators were blinded to patient data related to outcome, gender and other clinicopathological information.

### Gland isolation

A clean glass slide was placed into a 10-cm Petri dish and 500 μl PBS supplemented with RNAse and protease inhibitors was pipetted on top of the slide. The Petri dish was then transferred to the stage of a dissecting microscope. Tissue pieces were manually dissociated under the microscope using two 16G needles, with individual glands being pulled away from the tissue mass. For every specimen, a further epithelial minibulk sample that comprised a total of approximately 10–20 crypts or glands was collected. Each gland was transferred to a 1.5-ml Eppendorf tube containing a total volume of 50 µl cell lysis buffer (10 mM Tris-HCl pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% IGEPAL CA-630 supplemented with protease inhibitors (1 tablet in 50 ml dH2O as directed by the manufacturer (cOmplete Protease Inhibitor Cocktail, Sigma-Aldrich) and RNASE inhibitor 1 U µl−1 (Protector RNase Inhibitor, Sigma-Aldrich) and incubated on ice for 10–45 min. Bulk samples were collected in a final volume of 100 µl cell lysis buffer. We found that longer or warmer incubations decreased the RNA quality and yields and negatively affected chromatin structure. In selecting the 30 cases included in our study, we rejected only a single further case owing to being unable to isolate any glands, confirming that retention of glandular structures is pervasive in CRC.

### Chromatin, DNA and RNA separation

Each tube containing an individual gland or bulk was lightly vortexed, transferred to a pre-chilled centrifuge and spun at 500g for 10 min holding the temperature at 4 °C. This produced a pellet of cell nuclei at the bottom of the tube, with the cytosolic fraction present within the supernatant. For RNA extraction, 45 µl (glands) or 90μl (bulks) of the supernatant was transferred into a new tube containing 300 µl of TRIzol (taking care not to disturb the pellet). TRIzol lysates were stored at −20 °C if not processed immediately or at −80 °C for long-term storage. For extraction of nuclear material, the pellet of nuclei was resuspended in residual cell lysis buffer. A 2.5 µl volume of suspension was transferred into another tube for subsequent DNA extraction, which was frozen if required. The remaining suspension was immediately used for preparation of ATAC-seq libraries, as we found subsequent handling or storage compromised library quality.

### Preparation of ATAC-seq libraries

For library preparation, the transposed sample was supplemented with 1 µl of 10 µM Nextera i7 PCR primer, 1 µl of 10 µM Nextera i5 PCR primer (Illumina) and 12.5 µl of NEBNext Q5 High-Fidelity 2× PCR Master Mix (New England Biolabs). PCR amplification was performed, with initial elongation at 72 °C for 5 min, then initial denaturation at 98 °C for 30 s, and then 14 cycles (for glands) or 10 cycles (for bulks) of the following: 10 s of denaturation at 98 °C, annealing step at 63 °C for 30 s followed by 72 °C for 1 min.

Following amplification, samples were purified with 2× SPRI beads and eluted in 20–30 µl of 10 mM Tris buffer, pH 8. Samples were screened using the Agilent Tapestation 4200 and HSD1000 screentapes. Only those that showed a fragment size distribution with peaks at multiples of about 147 base pairs (bp), indicating intact nucleosomal structure within the nuclei, were sent for sequencing.

### Preparation of WGS libraries

DNA fractions were extracted using the Zymo QuickDNA Microprep plus kit according to the manufacturer’s instructions. Only samples with a total DNA yield higher than 10 ng were taken forwards for WGS library preparation. Libraries were prepared using the NEBNext Ultra II FS kit according to the manufacturer’s instructions. A short enzymatic fragmentation step of 5 min was performed, and five PCR cycles were used for library enrichment. After purification, libraries were quantified by Qubit and run on the Agilent Tapestation using HSD1000 screentapes. Samples with sufficient library DNA yield and characteristic fragment size distribution (about 200–500 bp) were further subjected to either low-pass (about 1× coverage) or deep (about 35× coverage) WGS.

### RNA library preparation

The cytoplasmic fractions of each sample in the form of TRIzol lysates were used for RNA extraction using the Directzol kit (Zymo R2052). Modifications to the manufacturer’s protocol were introduced to increase the total RNA yields. First, we passed the initial TRIzol and ethanol mix twice through the spin column. Second, we eluted the RNA using two 25 µl volumes of water instead of just one 50 µl elution. The optional DNase step was used.

Agilent Tapestation quality control showed low RNA integrity number scores (<3) for most samples and so was not used to exclude samples for library preparation. Libraries were prepared using the Illumina TruSeq RNA Exome kit (compatible with low-quality input material) according to the manufacturer’s instructions.

### Methylation arrays

DNA methylation array analyses were carried out on selected bulk samples with sufficient DNA yield. Genomic DNA was bisulfite-converted using the Zymo EZ DNA Methylation kit. A 50-µl reaction containing 2.5–100 ng of DNA was incubated in the dark using a modified conversion protocol: 95 °C for 30 s and then 50 °C for 60 min, for 16 cycles and then holding at 4 °C. The full 8 µl eluate of converted DNA was repaired using the Infinium HD FFPE Restore Kit (Illumina). All 8 µl of the bisulfite-converted DNA for each sample was analysed on the lllumina Human MethylationEPIC BeadChip (Illumina). Processing was carried out by the University College London Genomics Core Facility according to a standard protocol.

### Sequencing

Sequence libraries were multiplexed and sequenced on an Illumina NovaSeq, typically using S2 flow cells. Read length and depth were varied as required by library composition. Sequencing was performed by the Institute of Cancer Research Tumour Profiling Unit.

### Alignment for WGS

Contaminating adapter sequences were removed using Skewer v0.2.2 (ref. 62). Adapter sequences were 5′-AGATCGGAAGAGC-3′ and 5′-ACGCTCTTCCGATCT-3′, with a maximum error rate of 0.1, a minimum mean quality value of 10 and a minimum read length of 35 after trimming using the options -l 35 -r 0.1 -Q 10 -n. The trimmed and filtered reads from each sequencing run and library were separately aligned to the GRCh38 reference assembly of the human genome63 using the BWA-MEM algorithm v0.7.17 (ref. 64). Following the GATK best practices and the associated set of tools v4.1.4.1 (refs. 65,66,67), reads were sorted by coordinates (GATK SortSam), independent sequencing runs or libraries generated from the same tissue sample were merged and duplicate reads were marked using GATK’s MarkDuplicates. The structure of the final bam files was verified using GATK’s ValidateSamFile.

### Alignment for ATAC-seq

Adapter sequences were removed with Skewer v0.2.2 (ref. 62) using the following full-length adapter sequences with the option ‘-m any’: 5′-CTGTCTCTTATACACATCTCCGAGCCCACGAGACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG-3′ 5′-CTGTCTCTTATACACATCTGACGCTGCCGACGANNNNGTGTAGATCTCGGTGGTCGCCGTATCATT-3′.

The reads of each sequencing run and library were aligned to the GRCh38 reference genome using Bowtie2 v2.3.4.3 (ref. 68) with the options ‘–very-sensitive -X 2000’ set. After sorting the reads with SAMtools v1.9 (ref. 69), those mapping to non-canonical chromosomes and mitochondria (chrM) were removed (GATK PrintReads followed by RevertSam and SortSam). After merging independent libraries for each sample, we removed duplicate reads using GATK’s MarkDuplicates and removed all reads mapping to several locations (multi-mappers). The final bam files were validated with GATK’s ValidateSamFile.

### Detection of germline variants

HaplotypeCaller v4.1.4.1 with the GATK package70 was used to identify germline variants from the reference normal samples in each patient (buffy coats or adjacent normal tissue) using known germline variant annotations from build 146 of the dbSNP database71 separately for each chromosome. Resulting VCF files were then merged with GATK MergeVcfs. Variant recalibration was performed with GATK’s VariantRecalibrator with options set according to GATK best practices71,72,73,74 and applied to VCF files using GATK ApplyVQSR with the options ‘-mode SNP -ts-filter-level 99.0’ and ‘-mode INDEL -ts-filter-level 99.0’, respectively. All germline variant calls marked as PASS were retained.

### Verification of sample–patient matches

For all samples, we excluded the possibility of sample mismatch by comparing germline variants identified in normal tissue to neoplasia samples of a given patient. The reads of each read group were extracted with SAMtools view using the options ‘-bh {input_bam} -r {read_group_id}’, and GATK’s CheckFingerprint tool was applied to extract statistics on sample–patient matches75. For virtually all high-purity samples without extensive loss of heterozygosity, we were able to confirm that the samples were obtained from the expected patient. A few samples with high amount of LOH and high purity fingerprinting did not confirm the sample-patient match; for these we instead inspected copy-number profiles (see below) to confirm that these matched the remaining samples of the corresponding patient.

### Copy-number analysis

#### Deep WGS

Coverage values for genomic loci relative to matched normal tissue samples (buffy coats or adjacent normal tissues) were extracted with methods provided in the Sequenza v2.1.2 package for R (ref. 76) and binned in non-overlapping windows of 106 bp. B-allele frequencies of germline mutations determined with the GATK HaplotypeCaller (see above) for each patient were added to these binned files. Joint segmentation on B-allele frequencies and depth ratios across all samples from a given tumour were used to determine a set of breakpoints to use for the subsequent analysis. Specifically, GC content bias correction was applied using the gc.norm method from Sequenza v2.1.2, and positions with non-unique mappability (that is, <1), as determined by the approach of QDNAseq v3.8 (ref. 77), in windows of 50 bp were removed. Piecewise constant curves were fitted for each chromosome arm using the multipcf function (gamma = 80) from the copynumber v1.22.0 package for R (ref. 78). The per-patient set of breakpoints, binned depth ratio and B-allele frequency data were then inputted into the Sequenza algorithm (v2.1.2) to determine allele-specific copy numbers, ploidy Ψ and purity ρ estimates76. The initial parameter space searched was restricted to {ρ | 0.1 ≤ ρ ≤ 1} and {Ψ | 1 ≤ Ψ ≤ 7}. On manual review of the results, we identified several samples with unreasonable fits (cases in which calls suggested extremely variable ploidy values across samples). For these samples, we manually identified alternative solutions consistent with the other samples and somatic variant calls.

#### Low-pass WGS

Low-pass WGS bam files were processed using QDNAseq77 to count reads in 500-kilobase (kb) bins across the autosomes of hg38 and convert read counts into log2-ratios. Data normalization was performed in accordance with the QDNAseq workflow, except for outlier smoothing (smoothOutlierBins function), which was seen to artificially depress the signal from highly amplified bins. Bins for hg38 were also generated according to QDNAseq instructions. log2[ratio] values in each bin were normalized by subtracting the median log2[ratio] from all log2[ratio] values per sample. Samples from a patient were segmented jointly using the multipcf function in the R package copynumber (gamma = 10)78, and the mean segment log2[ratio] was calculated across the bins.

Absolute copy-number status was calculated using the approach taken by ASCAT79. Using the ASCAT equation to describe log2[R ratio] values, we took an integer ploidy value Ψt in the tumour t as determined by paired deep WGS in each case and searched a range of purities from 0.1 to 1 (and assumed gamma was 1 as is the case in sequencing data). For each purity (ρ) value, we calculated the continuous copy-number status of each bin and calculated the sum of squared differences of these values to the nearest positive integer of the modulus. Purity estimates were given by local minima (goodness of fit to integer copy-number values, measured as the sum of squared differences) across the purity range considered. The absolute copy-number state for each bin was taken as the closest integer value calculated using this purity. If no local minimum was found the purity was assumed to be 1. If the best solution produced negative copy-number states at some loci, these were set to have a copy number of zero to avoid impossible copy-number states. In two patients, per sample ploidies were determined by manual adjustment owing to integer ploidy values producing poor fits.

### SNV detection

Somatic mutations were first called for each tumour sample separately against matched blood derived buffycoats or adjacent normal tissue samples with Mutect2 (v4.1.4.1) using the options ‘–af-of-alleles-not-in resource 0.0000025 –germline-resource af-onlygnomad.hg38.vcf.gz’ (refs. 70,80). Variants detected in any tumour sample (marked PASS, coverage AD 10 in both normal and tumour, at least 3 variant reads in the tumour, 0 variant reads in the normal, reference genotype in normal and non-reference genotype in cancer) were merged into a single list of candidate mutations. The multi-sample caller Platypus v0.8.1.1 (ref. 81) was then used to recall variants at each candidate mutation position in all samples of the patient. In practice, this meant that the pipeline leveraged information across samples to improve the sensitivity of variant calling. The Platypus output of joint variant calls was then filtered to keep only high-quality variants with the flags PASS, alleleBias, QD or Q20, in canonical chromosomes (that is, not in decoy), a minimum number of reads NR > 5 in all samples, a genotyping quality GQ > 10 in all samples, a reference genotype (that is, 0/0) in the normal reference and a non-reference genotype (that is, 0/1 or 1/1) in at least one tumour sample.

To alleviate concerns of false-negative calls of mutations in important driver alterations, we generated a second set of variant calls for the identification of known driver mutations and dn/dS analysis (see details below) to which we did not apply the second step of filtering.

### SNV annotation

Somatic variants were annotated and candidate driver genes of CRCs reported by ref. 3 and IntOGen82 as well as pan-cancer driver genes reported by refs. 33,83 were filtered with the Variant Effect Predictor v93.2 (ref. 84).

### MSI status detection

The identification of MSI CRCs was performed with MSIsensor v0.2 (ref. 85). We first determined the position of microsatellite sites by applying the MSIsensor scan method to the GRCh38 reference assembly and subsetting the identified microsatellites to those located on the first chromosome. In a second step, we identified the fraction of mutated microsatellites in each sample using the MSIsensor msi method with default options. Generally, in known MSI cases (for example, those identified by mutation burden and mutational signature), more than 30% of microsatellites were mutated, and we used this as a critical value to classify cases as MSS and MSI. One exception was C562, in which the low purity of the samples led to a low MSIsensor score. However, this case was clinically classified as MSI by pathological reports, and it had a relatively high indel burden leading to the conclusion that it was MSI.

### Extraction of reads supporting variants

Using the VCF files from both somatic and germline variant calling, we extracted the number of reads supporting the reference and alternative alleles as well as the total number of reads covering the sites from WGS, low-pass WGS and ATAC-seq samples using Python and the Pysam library69, Pysam v0.15.2, SAMtools v1.9.

### dn/dS analysis

The dndscv package for R (ref. 33) was used for dn/dS analysis. Per-patient variant calls were obtained from the VCF files86 and lifted over to the hg19 reference genome using the rtracklayer package for R (ref. 87). Variants were divided into clonal mutations (that is, present in all samples) and subclonal mutations (that is, present in a subset of samples) present in the cancer and a set of mutations present in any of the adenoma samples. MSI and MSS cases were treated separately. dndscv was applied separately to each of the four sets (MSI or MSS and clonal or subclonal) (using default parameters apart from deactivated removal of cases because of a highe number of variants). Further, dn/dS values for a set of 167 chromatin modifier genes were extracted.

### ATAC-seq

#### Peak detection in ATAC peak-calling analysis

Peaks were called separately for each tumour region using MACS2 v2.21 (ref. 89) using ‘macs2 callpeak -f BED -g hs –shift –75 –extsize 150 –nomodel –call-summits –keep-dup all -p 0.01’ with the concatenated and sorted bed read files of nucleosome-free cut sites of all samples as input. A set of normal peaks (across patients) was called using the concatenated normal sample bed files (that is, region E samples) as input. Per-adenoma peaks were  called using all adenoma bulk samples as input.

#### Filtering and concatenation of peaks in ATAC peak-calling analysis

Per region peak calls were filtered for those having a q-value < 0.1%, enrichment > 4.0, and a maximum of the top 20,000 peaks. Iterative merging was then applied, using a method equivalent to that used in ref. 11 on per-region peak calls of individual patients (per-tumour peaks set) as well as across all cancer samples and pan-patient normal peak calls (pan-patient peak set). The iterative merging resulted in a total of n= 343,240 peaks, of which n = 67,215 peaks called in >2 tumour regions or the panel of normal were retained. The ChIPseeker v2.14.0 package for R (ref. 90) was used in combination with the TxDb.Hsapiens.UCSC.hg38.knownGene package v3.10.0 for R to annotate peaks on the basis of their genomic location. For peaks that were not proximal to known promoter regions (±3,000 bp), overlaps with known enhancer elements reported in the double-elite annotations of the GeneHancer database were examined91.

#### Extraction of cut sites in peaks in ATAC peak-calling analysis

Read counts for each peak in the final set were collated using bedtools92 using: ‘bedtools coverage -a bed peaks -b bed cut sites -split -counts -sorted’.

#### Purity estimation for ATAC-seq and accounting for CNAs

Clonal variants identified by paired WGS sequencing (clonal variants were those present in all samples from the cancer) were used to estimate sample-specific ATAC-seq purity. First, variants in intervals with identical (clonal) copy-number states (that is, A&B-allele states) and regions of closed chromatin were identified from WGS data. Copy-number values ci and mutation multiplicity mi of each variant site i were obtained from the WGS data. For a mutation at site i covered by ns,i reads in sample s, the number of reads ki containing the alternative allele is expected to follow a binomial distribution with the pdf

$$B\left({k}_{i}| {p}_{s,i},{n}_{s,i}\right)=\left(\begin{array}{c}{n}_{s,i}\\ {k}_{i}\end{array}\right){p}_{s,i}^{{k}_{i}}{\left(1-{p}_{s,i}\right)}^{{n}_{s,i}-{k}_{i}}$$

in which the expected success probability ps,i is a function of the sample purity ρs the number of mutated alleles in the tumour cells ms,i, the total copy number of the mutated site in the tumour cells cs,i and the copy number in contaminating normal cells cn = 2

$${p}_{s,i}=\frac{{\rho }_{s}{m}_{s,i}}{{\rho }_{s}{c}_{s,i}+\left(1-{\rho }_{s}\right){c}_{n}}=\frac{{\rho }_{s}{m}_{s,i}}{{\rho }_{s}{c}_{s,i}+2-2{\rho }_{s}}$$

A maximum-likelihood estimate of the sample purity pwas obtained by minimizing the negative-log-likelihood $$L({\rho }_{s})=\mathop{\sum }\limits_{i=0}^{N}-\log (B({k}_{i}|{p}_{s,i},{n}_{s,i}))$$ across all N mutated sites.

To account for the influence of CNAs on the read counts, the signal observed at a locus should be given by $$S={S}_{N}\frac{2\,\left(1-\rho \right)+\pi \rho }{2\left(1-\rho \right)+\psi \rho }$$, in which SN is the signal of the reference allele, $$\rho$$ is the purity of the sample, $$\pi$$ is the copy number of the locus, and $$\psi$$ is the ploidy of the tumour. For pooled samples, we calculate the average of S weighted by the total number of reads across samples. Indeed, CNAs were affecting the read depth at the locus (see the figures at https://doi.org/10.6084/m9.figshare.19849789 for details).

However, it is important to consider that, in general, CNAs are causing relatively small changes in the ATAC-seq signals compared to those of bona fide SCAAs. This was demonstrated by the strong correlation of the recurrence number in the model with copy-number adjustment versus the one without. This approach was most relevant in the identification of lost chromatin accessibility in regions with a copy-number gain and gained chromatin accessibility in regions with a copy-number loss.

#### Identification of recurrently altered peaks across patients

Analysis was restricted to samples with purity $$\rho$$ > 0.4. Peaks proximal (≤1,000 bp) to a transcription start site (TSS; that is, promoters) and those more distant to a TSS (that is, putative enhancers) were considered separately to account for the possibility of differential dispersion. Whereas we relied on proximity for promoters, we used the GeneHancer database for enhancers91. An overdispersed Poisson model was fitted to each peak using edgeR v3.30.3 (refs. 93,94), per-sample set normalization factors were calculated using the TMMwsp method95, a global dispersion estimate was estimated across sets from all cancers and each set of pure glands (per patient) was compared against a large pool of normal tissue ATAC-seq samples. Recurrently altered peaks were identified as those that were significantly altered at a level of P ≤ 0.01 in at least 4/26 (that is, 20%) of cases.

#### Identification of associated changes in gene expression

The basic processing of matched RNA-seq data is described in the associated manuscript38. A subset of 27,699 peaks that were either adjacent to a known TSS of a gene96 or overlapped a previously characterized enhancer element described in the GeneHancer database91 were identified. Of these 456/27,699 (≈1.65%) were recurrently altered. Changes in expression of genes associated with these sites were tested for using DESeq2 (ref. 97) to compare coefficients of the fitted β-binomial regression model (design: ~Patient, with all normal samples as ‘Normal’) with the contrast argument being a list of vectors containing the significant and non-significant patient sets.

For promoters, a one-tailed hypothesis test was applied by setting the altHypothesis argument to ‘less’ (for closed peaks) or ‘greater’ (for opened peaks). For enhancers, a two-tailed hypothesis test on all associated genes was applied by setting the altHypothesis argument to ‘greaterAbs’. P values from all tests were adjusted for multiple-hypothesis testing using FDR method98 associations at FDR < 0.1% where reported. For the visualization of gene expression values, the average variance stabilised log-transformed gene expression was calculated across samples of all each cancer and across all normal samples.

#### Identification of subclonal changes in recurrently altered peaks

Subclonality was assessed only for a set of recurrent somatic accessibility changes, comprising recurrent events affecting driver genes and the top 25 most recurrent in each of the 4 categories: gained promoter, lost promoter, gained enhancer and lost enhancer (total of 521 sites assessed).

Our previous analyses recognized that sample purity was highly correlated with tumour piece (regions A–D). To distinguish subclonal chromatin accessibility alterations from variability in ploidy, regression to account for purity was performed. Specifically, a log ratio test from DESeq2 was used to compare a ‘full model’ ~purity + region to a reduced model ~purity. Samples from the same region were used as biological replicates. Events were considered putatively subclonal when the adjusted P value was below 0.05 and if the direction of log[fold change] from analysis of matched bulk tissues was correlated with that observed in individual samples. In the case of gained events, subclonal events were filtered out if MACS peak-calling (see above) had not called a peak within 500 bp of the location of the putative gain event (this removed 33 sites). For losses, 5/45 subclonal events were removed as the log[fold change] was in the wrong direction.

For visualization of peaks, coverage per region was calculated 1 kb upstream and 1 kb downstream from the centre of the peak. Coverage was normalized per million reads in peaks and was plotted using functions from GenomicRanges99 and Gviz100.

#### Prediction of TF-binding sites

The motifmatchr package for R (ref. 101), a reimplementation of the C++ library MOODS102,103, was used to identify binding sites for all human TF motifs defined in a curated version of the CIS-BP database104. The list of predicted binding sites was filtered using a minimum significance value of P ≤ 10−6, followed by removal of binding sites in centromeric regions and non-autosomal (that is, sex and non-canonical) chromosomes. After this initial filtering, predicted binding sites were split into six distinct groups on the basis of their distance to the next TSS (proximal: d ≤ 2,000 bp; close: 2,000 bp < d ≤ 10,000 bp; distal: d > 10,000 bp) and whether they overlapped with a peak observed in the ATAC-seq data. For a number of TFs, homotypic clustering of binding sites in specific intervals was observed; to account for this, binding sites that were closer than d ≤ 1,000 bp to the next predicted binding site of the same TF were removed.

#### Extraction of signal values

For each of the TF sets described above, the counts of insertions around the centre of the TF-binding site (±1,000 bp) as well as the insertion size of the read pair (that is, thedistance to the second nick) for each sample99 were tabulated. The insertion sizes (rows) were binned into intervals of 5 bp and divided by the total count of reads with an equivalent size in the entire genome. After this, the background signal was estimated to be theaverage number of insertions 1,000 bp–750 bp from the centre of the TF-binding site per insertion size and subtracted from the counts. The difference between these normalized and background-corrected TF signals in each sample and a pool of normal samples was calculated and integrated across the central region of the TF-binding sites (insertion size [25;120], distances [−100 bp;100 bp]) as a summary statistic. Linear regression analysis was used to identify associations with purity estimates, and in this context, signals were found to correlate with TSS enrichment (TSSe; for both nucleosome-free and all reads). For thisreason, a further term was added to the regression model of each TF to correct for this effect: signal ≈ tsse*tssenf + purity:patient (where ‘:’ indicates an interaction between two or more variables in the model formula and ‘*’ indicates all the main effects and interactions among the variables that it joins), in which tsse and tssenf are the differences in TSSe between the sample and the pooled normal samples, and each observation was weighted by the square root of the number of reads in the sample. A second linear model in which a region-specific effect of the purity (signal ≈ tsse*tssenf + purity:region) was considered was also fitted to the data. For both models, the statistical significance of the purity coefficient was determined. The estimates of the coefficients were also used as a patient-specific summary for subsequent analysis.

#### Cluster analysis

The analysis was focused on the 150 TFs for which a significant association with the tumour cell content (that is, the purity) and TF signal was most frequently observed. With the aim to identify general patterns in these data, a clustering analysis was conducted (hierarchical clustering with Euclidean distance and complete linkage). This method identified three main groups of TFs, each of which was analysed with STRINGdb105 to identify significantly overrepresented pathways.

#### Methylation array analysis

A reference normal methylation array dataset was downloaded from ref. 106 that included normal tissue sampled adjacent to CRCs that was profiled using the HumanMethylation450 BeadChip array (Illumina).

Here, eight bulk samples from four cases (C516, C518, C560 and C561) were profiled using the MethylationEPIC BeadChip (Infinium) microarray according to the manufacturer’s instructions.

The ChAMP R package pipeline107 was used to analyse the methylation bead array data. Probes that had a detection P > 0.01 and probes with <3 beads in at least 5% of samples per probe, probes that were on the X or Y chromosome, all probes associated with single nucleotide polymorphisms and all multi-hit probes were removed. Subset-within-array normalization was used to correct for biases resulting from type 1 and type 2 probes on the array. After quality control and normalization, β-values were calculated for further comparison.

To compare the methylation patterns between our samples and the reference normal dataset, the overlapped probes of all samples located distal to the TSS, close to the TSS and proximal to the TSS, both on the ATAC peak and not on the ATAC peak were compared.

#### Processing of RNA-seq

After initial quality control with FastQC (https://github.com/s-andrews/FastQC) and default adapter trimming with Skewer62, paired-end reads were aligned to the GRCh38 reference genome and v28 of the Gencode GTF annotation using the STAR two-pass method108. Read groups were added with Picard v2.5.0 (http://broadinstitute.github.io/picard). Per-gene read counts were produced with htseq-count, which is incorporated in the STAR pipeline108.

#### Filtering of RNA samples

Raw gene counts were first filtered for reads uniquely assigned to non-ribosomal protein-coding genes located on canonical chromosomes (chr1-22, X and Y). If samples had fewer than 5 million of these ‘usable’ reads, they were resequenced to improve coverage. When possible, the same library preparation pool was sent again for sequencing. These ‘top-ups’ proved to be true technical replicates, as the resulting gene expression of the resequenced samples clustered very closely to their original samples on both a sample–sample heatmap and a principal component analysis. It was therefore determined that the FASTQs of these samples could simply be merged at the start of the pipeline. In cases in which resequencing was required but insufficient library remained, a new library was prepared, and the sequencing run that produced the highest read was used in subsequent analysis. For eight samples, the sequencing of the second library contained too few reads to enable downstream analysis. Six of eight samples showed per-gene read counts that were very similar between libraries 1 and 2 (Spearman’s rank correlation between replicates was significantly higher than the mean; Wilcoxon one-way rank test; FDR < 0.01) and so read counts were combined across libraries; the two remaining samples were discarded. Samples were also discarded if matched DNA sequencing revealed a tumour purity of less than 0.1.

#### Gene expression normalization and filtering

The number of non-ribosomal protein-coding genes on the 23 canonical chromosome pairs used for quality control was 19,671. Raw read counts uniquely assigned to these genes were converted into both transcripts per million and variance-stabilizing transformed (vst) counts using DESeq2 (ref. 97).

A list of expressed genes (n = 11,667) was determined by filtering out genes for which less than 5% of tumour samples had at least 10 transcripts per million. To concentrate on tumour epithelial cell gene expression, genes were further filtered out if they negatively correlated with purity as estimated from matched DNA-sequencing data. Specifically, for the 157 tumour samples that had matched DNA sequencing and therefore accurate purity estimates, a linear mixed-effects model of exp(vst) ≈ Purity + (1|Patient) was compared using a chi-squared test to exp ≈ (1|Patient). Genes that had a negative coefficient for purity in the first model and an FDR-adjusted P value less than 0.05, suggesting that purity significantly affected the expression, were filtered out. This led to a filtered list of 11,401 expressed genes.

#### Mutational signature analysis

Mutational signature analysis was performed with SparseSignatures51. This method uses LASSO regularization109 to reduce noise in the signatures, controlled by a regularization parameter lambda (λ). It implements a procedure based on bi-cross-validation110 to select the best values for both the regularization parameter λ and the number of signatures. Deconvolution using a maximum of 10 signatures was performed and values of λ of 0.000, 0.025, 0.050 and 0.100 were tested. Optimal parameters were selected on the basis of the median bi-cross-validation error estimated over 1,000 iterations, resulting in an optimal estimate with minimum cross-validation median error when 6 signatures were fitted and λ = 0.025. A second analysis with SigProfiler111, with default parameters and a total of 1,000 iterations, confirmed the existence of these signatures.

Signature-based clustering was performed considering the six-signature solution by SparseSignatures; the signatures exposure matrix given as an output by the tool was used to compute the pairwise similarity matrix for each patient as 1 minus the cosine similarity of their exposures. Clustering was then performed on the similarity matrix by k-means with six clusters explaining all of the variance. Although from a statistical perspective clusters C3 and C4 are defined by a small number of samples (and explain 3% and 4% of the variance, respectively), from the biological perspective, we have evidence that in these patients the distribution of mutations resembles very different signatures and mutational processes (Supplementary Fig. 15A).

Mutational signature exposures were also analysed across epigenetic regions. Mutations were first grouped as clonal or subclonal across the whole genome and then in different genomic regions (as described above). Signature activities in each region were estimated by jackknife sampling112. Specifically, data from each patient were partitioned on the basis of their clusters as defined above, and repeated jackknife sampling was performed 100 times independently for each of the 3 clusters (including a random sample of 90% of the tissue samples each time). For each iteration, the mutations in each genomic region were used to compute a data matrix normalized against the trinucleotide content (across the 96 channels) in the whole genome versus region-specific counts, and signature assignments were then performed on the normalized data by LASSO51,109. Finally, relative signature activities estimated over the 100 jackknife samples were normalized on the basis of the total size of each region. Moreover, as clusters C3 and C4 represent rare and very distinct mutational patterns, we excluded these samples from the estimation of mutational processes in the epigenetic regions by jackknife, and instead we focused on MSS (cluster 1) versus MSI (clusters 2 and 5) tumour, as the samples in clusters C3 and C4 would probably have biased the jackknife estimation for these two groups.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

#coevolution #genome #epigenome #colorectal #cancer