Sequencing Samples And Dna Extraction

Published Date: 02 Nov 2017

1.1 Sequencing samples and DNA extraction

There are about 45 diploid species with 26 chromosomes (2n=26) and 5 tetraploid species with 52 chromosomes (2n=52) in genus Gossypium. Diploid species can be divided into 8 genomic groups including A-G and K genomes. A, B, E, and F genomes are classified into African clade because they originated in Africa and Asia1, and similarly C, G, and K genome belong to Australian clade. For D genome clade, it is indigenous to the Americas. All tetraploid species came from interspecific hybridization between an A genome and a D genome. The A genome species Gossypium herbaceum (A1) and Gossypium arboretum (A2) and the D genome species Gossypium raimondii (D5) are the closest extant relatives of the original tetraploid progenitors 2(Brubaker et al., 1999). The A genome species can produce spinnable fiber but the D genome species donï¿½ï¿½t have such ability 3.The sequenced material used here was G. arboreum L. cv. Shixiya 1. This variety was highly homozygous after selfed ten times. Young leaves were collected from Shixiya-1 for extracting DNA for sequencing, using a protocol4 with a CTAB buffer already described for isolation of genomic DNA.

1.2 Sequencing and Assembly

By using the whole genome shotgun sequencing (WGS) approach, total 371.5 Gb raw paired ends Illumina reads of different insert sizes from 180 bp to 40 kb were generated (Supplementary table 1). Some unusable reads were filtered before assembly including: (1) reads contained more than 10% sequences of ï¿½ï¿½Nï¿½ï¿½, (2) 65% bases of short insert-size reads or 80% bases of long insert-size reads are low quality data (quality ï¿½ï¿½ 7), (3) reads contained more than 10 base pairs of adapter sequences, (4) more than 10 base pairs were overlapped between two ends of short insert-size reads, (5) reads with identical sequences of two ends. After filtering, about 193.6 Gb high quality data were used for the de novo assembly (Supplementary table 2). The G. arboreum genome size was estimated to be 1.724 Gb by K-mer analysis (Supplementary Fig.1 & table 3). The reads with short insert size (<1000bp) were firstly assembled to obtain long contigs using the de Bruijn graph based on assembler SOAPdenovo 5. In order to construct scaffolds, all the usable reads were aligned onto the contig sequences and over 80% of all the aligned paired-end reads were obtained. Subsequently, we calculated the total amount of pair-end relationships between each pair of contigs and analyzed the rate of consistent and conflicting paired-ends, then constructed the scaffolds step by step from short to long insert-size paired-ends. For closing the gaps inside the constructed scaffolds that were mainly composed of repeats sequences, we retrieved the read pairs that one end of it mapped to the unique contig and the other end located in the gap region by using the paired-end information, and then did a local assembly for these collected reads. Finally, total 1.69 Gb of the genome was assembled with the contig N50 71,999 bp (the longest 790,155 bp) and scaffold N50 of 665,787 bp (the longest 5,948,726 bp) (Supplementary table 4).

Supplementary table 1 raw data

Pair end Libraries Insert size (bp) Total Data(Gb) Read Length(bp) Sequence Depth (X) Physical Depth (X)

180 17.2 100bp/100bp 10.0 9.0

250 36.9 150bp/150bp 21.5 17.9

56.3 150bp/150bp 32.7 27.3

350 15.1 150bp/150bp 8.8 10.2

13.6 100bp/100bp 7.9 13.8

500 16.7 150bp/150bp 9.7 16.2

14.7 100bp/100bp 8.5 21.4

Solexa 800 33.8 150bp/150bp 19.7 52.4

2000 38.6 90bp/90bp 22.4 249.4

5000 21.3 90bp/90bp 12.4 344.0

10000 22.4 90bp/90bp 13.0 723.5

20000 43.5 90bp/90bp 25.3 2,810.1

40000 31.9 90bp/90bp 18.5 4,121.4

9.5 50bp/50bp 5.5 2,209.3

Total --- 371.5 --- 216.0 10,625.9

Supplementary table 2 clean data

Pair end Libraries Insert size (bp) Total Data(Gb) Read Length(bp) Sequence Depth (X) Physical Depth (X)

180 16.1 100bp/100bp 9.4 8.4

250 22.1 150bp/150bp 12.8 10.7

44.8 150bp/150bp 26.0 21.7

350 9.3 150bp/150bp 5.4 6.3

11.2 100bp/100bp 6.5 11.4

500 9.9 150bp/150bp 5.8 9.6

11.7 100bp/100bp 6.8 17.0

Solexa 800 21.1 150bp/150bp 12.3 32.7

2000 19.7 90bp/90bp 11.5 127.3

5000 8.3 90bp/90bp 4.8 134.0

10000 5.9 90bp/90bp 3.4 190.6

20000 7.4 90bp/90bp 4.3 478.0

40000 4.1 90bp/90bp 2.4 529.7

2.0 41bp/41bp 1.2 567.2

Total --- 193.6 --- 112.6 2,144.7

Supplementary Fig. 1 K-mer analysis for estimating G.arboreum the genome

The x-axis is depth (X); the y-axis is the proportion that represents the frequency at that depth divide by the total frequency of all the depth. Without consideration of the sequence error rate, heterozygosis rate and repeat rate of the genome, the 17-mer of distribution should obey the Poisson theoretical distribution. From the actual data, due to the sequence error, the low depth of K-mer frequency will take up a large proportion. At the same time, for some specifically genome, the certain heterozygosis rate can cause a sub peak at the position of the half of the main peak, while a certain repeat rate can cause a repeat peak at the position of the integer multiples of the main peak.

Supplementary table 3 Estimation of G. arboreum genome based on K-mer statistics

K K-mer_Num Peak_Depth Genome Size Used Bases Used Reads X

17 44,077,776,963 25 1,723,966,501 53,095,157,475 563,586,282 30.11

Supplementary table 4 Statistics of the final assembly. The G. arboreum genome size was estimated to be 1.724 Gb, the assembled scaffolds covered about 98.3% of the genome

Contig Scaffold

Size(bp) Number Size(bp) Number

N90 16,462 22,731 148,103 2,724

N80 31,563 16,078 273,645 1,891

N70 44,355 11,935 389,133 1,373

N60 57,513 8,847 516,381 997

N50 71,999 6,417 665,787 708

Longest 790,155 5,948,726

Total Size 1,561,222,151 1,694,263,539(98.3%)

Anchored Size 1,532,305,448(90.4%)

1.3 GC content distribution

We found the G. arboreum and the G. raimondii had the similar GC content distribution when we compared them using 500-bp sliding windows (250bp overlapping) along the two genomes (Supplementary Fig. 5).

Supplementary Fig. 5 GC content of G. arboreum and G .raimondii

The x-axis is GC content and the y-axis is the proportion of the windows number divided by the total windows. According to this graph, G. arboreum and G. raimondii, in general, have the similar distribution curve.

2 Genetic map construction

A total of 154 F2 seedlings of the cross between XX and XX were used to construct the linkage map. The linkage analysis was performed using JoinMap6, version 4.0. RAD-based SNP markers were first tested against the expected segregation ratio. For ABH-style makers, segregation ratio of F2 after two parents was expected at 1:2:1. For AC-style makers, segregation ratio of F2 after two parents was expected at 1:3. Distorted markers (P<0.01) were filtered to construct a genetic map by chi-square test. Then, reads that contained SNP markers were aligned to the scaffolds. Firstly, ABH-style makers were used to build the main frame of total 13 linkage groups and LOD score of 7.0 was initially set as the linkage threshold for linkage group identification. 13 linkage groups that had the same number of cotton-A chromosomes were formed. Secondly, AC-style makers were also included to expand the length of linkage groups by reducing LOD values until to 3. All SNP markers were used to construct the consensus map with the F2 population model in JoinMap, version 4.0. To reduce the complex of scaffolds that were anchored to hundreds of SNP markers, one or two tag SNPs was selected from each scaffold with multiple SNPs. We calculated the recombination fractions between all pairs of SNPs in a scaffold and chose the SNP that had the minimum recombination fraction in the sum. Tag SNPs were used to identify the order of scaffolds. Then, SNPs were used to orient the scaffolds. Finally, 13 high density linkage groups were generated and about 90% of 1.69 Gb genome sequences could be anchored to the 13 linkage groups (Supplementary Fig. 2 ).

Supplementary Fig. 2 13 linkage groups of A genome of cotton.

3 Genome annotation

3.1 Repeat annotation

We search the genome for tandem repeats with the help of software named Tandem Repeats Finder (TRF404 with parameters: 2 7 7 80 10 50 2000 -d -h)7. Two methods were used to find the transposable elements (TEs) including homology-based and de novo approaches. Homology-based approach involves commonly used databases of known repetitive. We use the known Repbase8 (composition of many TEs) to find the repeat. Such as: RepeatProteinMask, RepeatMasker9.

TEs could be found both based on the DNA and protein level. RepeatMasker was applied for DNA-level identification using a custom library (a combination of Repbase, plant repeat database and our genome de novo TE library). At the protein level, RepeatProteinMask the program in the RepeatMasker package, was used to perform WuBlastX against the TE protein database. Firstly, we used four denovo softwares including LTR_FINDER10, PILER11, RepeatModeler12 and RepeatScout13 to build denovo repeat library in base of genome. The softwares mentioned predict repeats in different ways: 1) full length LTR (Long terminal repeat retrotransposons) has typical structure and contain a ~18bp sequences that complemented to the 3' tail of some tRNA, LTR_FINDER search the whole genome for the LTR typical structure; 2) PILER search repeats in the genome by align the genome with itself; 3) RepeatScout build consensus sequence based on lmer using fit-prefered alignment score; 4) two ab initio repeat prediction programs (RECON &RepeatScout) are employed by RepeatModeler, which can identify repeat element boundaries and family relationships from sequences. Then filter contamination and multicopy genes in the library. Use this library as the library of RepeatMasker, run the software again to find homolog-based repeats in the genome and classify the repeats that were found (Supplementary table 5 & 6).

Supplementary Table 5 Statistics of the repeat content of the cotton-A genome

Type Repeat Size(bp) % of genome

TRF 64,530,987 3.41

RepeatMasker 269,003,599 14.20

RepeatProteinMask 362,823,465 19.16

De novo 1,304,799,791 68.89

Total 1,352,593,545 71.42

Supplementary Table 6 The ratio of TE in the cotton-A genome.

Type RepBase TEs TE Proteins De novo Combined TEs

Length(bp) %in Genome Length (bp) % in Genome Length(bp) % in Genome Length(bp) %in Genome

DNA 7,183,905 0.38 9,970,349 0.53 20,870,967 1.10 30,504,936 1.61

LINE 4,153,912 0.22 13,560,225 0.72 7,587,900 0.40 21,086,591 1.11

LTR 19,460 0.001 0 0 406,982 0.02 426,118 0.02

SINE 25,826,0283 13.64 339,264,441 17.91 1,268,466,453 66.97 1,289,922,523 68.11

Other 19,439 0.001 0 0 0 0 19,439 0.001

Unknown 0 0 56,624 0.003 11,253,946 0.59 11,310,562 0.60

Total 26,900,3599 14.20 362,823,465 19.16 1,300,259,389 68.65 1,336,438,702 70.56

Note: Repbase TEs: the result of RepeatMasker based on Repbase; TE proteins: the result of RepeatProteinMask based on Repbase; De novo: Result of RepeatMasker by using library predicted through De novo; Combined: combine the results of Repbase TEs, TE proteins and De novo.

3.2 Gene prediction

The strategy for gene prediction was to perform de novo predictions on repeats pre-masked genome. We use some annotation tools such as Augutus14, SNAP15 and choose the appropriate parameter to predict the site of coding gene. We get the homolog protein sequence of genome. And then map them to the genome using tblastn, and then the homologous genome sequences were extracted with 2000 base pairs flank sequences, and such sequences aligned against the matching proteins again using GeneWise16 for accurate spliced alignments. We get the coding sequence of genome, like: EST, FLcDNA, unigene and so on, to aligneagainst genome using BLAT17 to generate spliced alignments. And filter the overlapping to link the spliced alignments using PASA (Supplementary table 7 & Supplementary Fig. 3 & 4).

Supplementary Table 7 Statistics of predicted protein-coding genes

Gene set Number Average transcript length (bp) Average CDS length (bp) Average exon per gene Average exon length (bp) Average intron length (bp)

De novo AUGUSTUS 40,614 2425.77 1,079.90 4.63 233.04 370.37

SNAP 59,442 1282.19 710.29 3.37 210.96 241.62

Homolog A. thaliana 32,477 2847.49 1,121.47 4.94 226.95 437.90

C. papaya 38,851 2300.92 953.34 4.10 232.38 434.36

P. trichocarpa 38,619 2647.13 1,046.35 4.50 232.42 457.11

T. cacao 45,765 3305.25 1,025.32 4.40 232.78 669.66

V.inifera 33,841 3299.67 1,077.07 5.11 210.93 541.27

Unigene 67,888 2960.74 660.43 2.68 246.22 1367.42

GLEAN 39,512 2341.04 1,090.28 4.57 238.38 349.99

RNA-Seq 39,729 2475.35 1,091.35 4.54 240.50 350.38

Final set 41,330 2533.22 1,082.54 4.58 236.47 367.79

Supplementary Fig. 3 statistics of gene coverage by RNA data (41,330)

Note: >0.9 means that more than 90% gene sequences were supported by RNA sequences, which were the same to others.

Supplementary Fig. 4 comparisons of gene features between four species

3.3 Gene functional annotation

Gene functions were assigned according to the best match of the alignments using blastp to SwissProt and TrEMBL18 databases. InterProScan19 determined the motifs and domains of genes against protein databases including Pfam, PRINTS, PROSITE, ProDom, and SMART. Gene Ontology20 IDs for each gene were obtained from the corresponding InterPro entry (Supplementary table 8).

Supplementary table 8 Functional annotation of gene models

Number Percent (%)

Total 41,330 100

Annotated InterPro 28,398 68.71

GO 21,726 52.57

KEGG 21,496 52.01

Swissprot 27,130 65.64

TrEMBL 35,395 85.64

Unannotated 5,768 13.96

3.4 ncRNA annotation

The tRNAscan21 was used to predict tRNA genes with eukaryote parameters. We obtained rRNA sequences by using BLAST software based on known plant rRNA elements of Rfam database. INFERNAL software22 was used to identify the miRNA and snRNA also based on Rfam23 (Supplementary table 9).

Supplementary table 9 statistics of non-coding RNA

Type Copy Average length (bp) Total length (bp) % of genome

miRNA 431 114.3712 49,294 0.002603

tRNA 2,289 75.06116 171,815 0.009072

rRNA 10,464 117.6673 1,231,271 0.065011

rRNA 18S 3,732 173.4611 647,357 0.03418

28S 1,317 100.4085 132,238 0.006982

5.8S 479 100.9165 48,339 0.002552

5S 4,936 81.71333 403,337 0.021296

snRNA 7,619 107.7172 820,697 0.043333

snRNA CD-box 7,460 106.85 797,101 0.042087

HACA-box 29 128.3448 3,722 0.000197

splicing 130 152.8769 19,874 0.001049

4 LTR analysis

4.1 Distribution of LTR and its relationship with genes

LTR retrotransposons exhibited family-specific, no uniform distributions along chromosomes in maize and Gossypium genome that copia-like elements are over represented in gene-rich achromatic regions, whereas gypsy-like elements are overrepresented in gene-poor heterochromatic regions24 25. Comparing to larger content of gypsy-like than copia-like, it confirmed previous conclusion24 that copia-like was likely to insert to gene rich region than gypsy-like. A statistics analysis in distances of the nearest TE from each gene also supports the insertion bias between those two types of TEs (Supplementary Fig. 5).

Supplementary Fig. 5 The statistics analysis in distances of the nearest TE from each gene of Ty1/Copia and Ty3/Gypsy

Recent retro element activity is widely distributed across the G. arboreum genome, young LTR retrotransposon insertions in recent 5 million years appear randomly distributed along the genome, with G. arboreum having a higher ratio of genes with a LTR nearby than G. raimondii, and this distance is squinted toward larger values in G. arboreum. Together, these observations are consistent with a model26 27 under which selection purges transposable elements with deleterious effects on adjacent genes such that transposable elements more distant from genes preferentially survive and with transposable element elimination having been more efficient in G. raimondii genome.

4.2 LTR insert time analysis

Intact LTRs of A genome and D genome were predicted respectively by using LTR_STRUC28 Windows based software. We constructed the ancestor subfamilies based on some rules: the same family should share the PPT sequences (polypurine tract) and PBS sequences (primer-binding site), the e-value was less than 1e-10 among transposable gene sequences29. For each family, we aligned all the intact LTRs by MUSCLE30 and manually corrected by MEGA5.531, then constructed ancestor sequences for this family by using cons program (contained in EMBOSS package32). For finding more solo-LTRs, we queried those ancestor sequences of different families to LTR prediction results by standard prediction pipeline, and we followed 80-80-80 rules (identity more than 0.8, align-rate more than 80%, alignment length more than 80 bp) to find which family the solo-LTR belonged to33. All intact LTRs and solo-LTRs were used to calculate the insert time by formula Time=K/r (where K means that the distance between all alignment pairs and r means the rate of nucleotide substitution). The value was set to 7e-9 and the K was calculate by distmat program implemented in the EMBOSS package with the Kimura two-parameter model.

5 Evolutionary analysis

5.1 Comparative gene inventories of Gossypium and angiosperm

Orthologous gene shows the important relationship among different species. We used OrthoMCL 34 to confirm the orthologous genes among the eight species (C. papaya, A. thaliana, G. arboreum, G. raimondii, T. cacao, P. trichocarpa, R. communis and O. sativa) (Supplementary table 9). Firstly, BLASTP was used to compare all the protein sequences with a database containing a protein dataset of all the species with E-value less than 1e-5. And then clustering of genes was performed by OrthoMCL (inflation parameter: 1.5) (Supplementary Fig. 6).

Supplementary table 9

Species Genes number Genes in families Unclustered genes Family number Unique families Average genes per family

A.thaliana 26,637 22,837 3,800 13,633 724 1.68

T.cacao 28,624 23,927 4,697 16,226 472 1.47

G.raimondii 40,976 33,943 7,033 20,507 644 1.66

G.arboreum 41,330 32,608 8,722 20,801 534 1.57

C.papaya 25,599 18,327 7,272 13,736 486 1.33

R.communis 30,984 20,356 10,628 15,483 765 1.31

O.sativa 33,127 22,133 10,994 12,747 1,646 1.74

V.vinifera 25,329 18,817 6,512 13,549 637 1.39

Note:

Supplementary Fig. 6 a

Supplementary Fig. 6 b

5.2 Phylogenetic Analysis

To get a sight into the phylogentic relationship of the development in genome, we use the single-copy gene families of the eight species to reconstruct the phylogentic tree (C. papaya, A. thaliana, G. arboreum, G. raimondii, T. cacao, P. trichocarpa, R. communis , G.max and O. sativa). Firstly we performed multiple alignments of protein sequences for each single-copy gene family by Muscle, and converted the protein alignments to CDS use a Perl script. Phase-1 sites were extracted from each family and concatenated to one supergene for one species, and Mrbayes 3.1.235 was used to construct the phylogenetic tree. Branch-specific dN and dS were estimated with codeml in PAML36 with branch model.

5.3 Estimation of Divergence Time

The BRMC approach was used to estimate the species divergence time using the program MCMCTREE, which was part of the PAML package. ï¿½ï¿½Independent rates molecular clockï¿½ï¿½ and ï¿½ï¿½HKY85ï¿½ï¿½ model in MCMCTREE program were used in our calculation. The MCMC process of PAML MCMCTREE program was run to sample 200000 times, with sample frequency set to 2, after a burn-in of 20000 iterations. ï¿½ï¿½fine-tuneï¿½ï¿½ parameters were set to make acceptance proportions fall in interval (0.15, 0.7). Other parameters were the default. Two independent runs were performed to check convergence.The calibration time 148 Mya from arabidopsis-rice divergence and 109 Mya from soybean-arabidopsis divergenece was achieved from the TimeTree database.

5.4 The expansion of gene families

Numbers on each branch report number of genes gained or lost. Pie charts near branches show the proportion of families that expanded (green), contracted (red), and did not change (blue). The largest pie chart shows the proportion of all families that change (orange), or remain constant (blue) across all lineages. Changes along long branches and on the in-group branch may represent underestimates due to multiple gains and losses within individual families and or lack of phylogenetic resolution.

To obtain greater insight into the evolutionary dynamics of the genes, we inferred the expansion and contraction of the gene ortholog clusters among T. cacao, G. raimondii, G. arboreum, A. thaliana, C. papaya, R. communis, P. trichocarpa. We used CAFE37, a maximum likelihood method to estimate the ortholog cluster sizes in their common ancestor, and then defined the expansion and contraction by comparing the cluster size differences between the ancestor and each of the current species. The gene number of G. arboreum is similar to the G. raimondii, and there are no major differences in their Gene Ontology distributions. We found it is different in gene copy numbers in signal transduction/transducer activity, receptor activity and cell communication, but nearly the same in most of other functional groups (Supplementary Fig. 7).

Supplementary Fig. 7

5.5 Gene synteny analysis and whole genome alignment

The syntenic blocks between two genomes were identified by several steps. At first a BLASTP (e-value 1e-5) comparison was performed to find the pairwise gene information between the two gene models. The blast output normally would contain multiple matches for each gene pair. We selected the one which the e-value is the minimum. The syntenic blocks (ï¿½ï¿½5 genes per block) were constructed using the program MCscan38 based on the aligned protein gene pairs identified before. For the alignment results between them, each aligned block represents the orthologous pair derived from the shared ancestor and the sequences that contain the genes were all elected to show the inter-genome relationships with their length information. Three genomes (G. raimondii, G. arboretum and T. cacao) were compared to each other. The method described above was also used to identify the paralogous segments pair that arose from the genome duplication whereas in lotus genome by aligning to itself. The 4DTv value of the blocks was calculated (revising using the HKY model). Whole genome alignment was done by LASTZ39 between G. raimondii and G. arboretum (firstly masked the repeats regions).

6 Other analysis

6.1 Disease resistanceï¿½Crelated genes

The largest class of characterized R genes encodes intracellular proteins that contain a nucleotide-binding site (NBS) and carboxyterm in leucine-rich-repeats (LRR)40, play an important role in resistance to pathogens and in the cell cycle 41, this gene family is rather abundant in plant genomes, ranging from 0.6% to approximately 2% of the total gene number42 while 0.8% in G. arboreum. The NBS-coding R gene family has 331 members, similar to cacao, popular, and grape, but twice as high as Arabidopsis; six fold as high as papaya. The NBS family can be divided into multiple subfamilies with distinct domain organizations, including 271 NBS-LRR genes, 60 NBS genes that lack an LRR (Table S13 and Fig. S7)

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now