Construction of Erysimum cheiranthoides genome v2.0
Project Leader: Georg Jander (gj32@cornell.edu)
Gordon Younkin (gcy7@cornell.edu) - genome assembly update and gene anotation liftover
Jing Zhang and Susan Strickler (srs57@cornell.edu) - functional annotation
16 February 2021

Summary:
v2.0 of the E. cheiranthoides genome is a chromosome-level reassembly of the original PacBio contigs based on a 501-marker genetic map constructed from an 83-member F2 population resulting from a cross between E. cheiranthoides var. Elbtalaue and E. cheiranthoides var. Konstanz. v1.2 of the E. cheiranthoides genome, which is a proximity-guided assembly based on a Hi-C scaffolding (Z

Methods:
To generate markers from RNAseq data, read mapping and SNP calling were performed following the Genome Analysis ToolKit (GATK) best practices for RNAseq short variant discovery (DePristo et al. 2011; Van der Auwera et al. 2013). RNAseq data from 83 F2 plants, five var. Konstanz, and five var. Elbtalaue plants were aligned to unpolished PacBio contigs using STAR version 2.7.1a default parameters and 2-pass mapping (Dobin et al. 2013). The resulting bam files were cleaned using the GATK functions MarkDuplicates, AddOrReplaceReadGroups, and SplitNCigarReads. Variants were called with HaplotypeCaller, and joint genotyping was performed using GenotypeGVCFs (McKenna et al. 2010). The resulting vcf file was filtered using bcftools filter (Li 2011) to include only biallelic SNPs with a quality score greater than 30, alternate allele frequency between 0.3-0.7, excess heterozygosity less than two, and a called genotype in at least half of the samples. The filtered vcf was converted to ABH using Tassel 5 (Bradbury et al. 2007), the markers were binned using SNPbinner (Gonda et al. 2019), and a genetic map was constructed using MSTmap (Wu et al. 2008). During map construction, one contig was found to be chimeric and was split at the most likely splice point, as determined by a visual analysis of aligned PacBio reads.
The resulting genetic map was reconciled with the Hi-C proximity guided assembly using a custom Python script that prioritized placement and orientation of contigs in the genetic map. The final fasta assembly containing pseudomolecules and contigs was constructed using CombineFasta (https://github.com/njdbickhart/CombineFasta). The chloroplast genome was assembled from PacBio reads using Organelle_PBA (Soorni et al. 2017; Zhang et al. 2020). Illumina reads were aligned to the new genome using Burrows-Wheeler Aligner version 0.7.8 (Li and Durbin 2009), and the assembly was polished with three rounds of Pilon version 1.23 (Walker et al. 2014). Gene annotations were transferred from v1.2 to v2.0 using GMAP (Wu and Watanabe 2005). BLAST and InterProScan were used to assign putative functions.