U.S. flag

An official website of the United States government

Other Access

The information on this page (the dataset metadata) is also available in these formats:

JSON RDF

via the DKAN API

Data Extent

Data from polishCLR: Example input genome assemblies

[ NOTE - Data files added 2022-11-01:
- Test long reads - test.1.filtered.bam_.gz
- Test short reads R1 - testpolish_R1.fastq
- Test short reads R2 - testpolish_R2.fastq
- Chromosome 30 of H. zea - GCF_022581195.2_ilHelZeax1.1_chr30.fasta ]

In order to produce the best possible de novo, chromosome-scale genome assembly from error prone Pacific BioSciences continuous long reads (CLR) reads, we developed a publicly available, flexible and reproducible workflow that is containerized so it can be run on any conventional HPC, called polishCLR. This dataset provides example input primary contig assemblies to test and reproduce the demonstrated utility of our workflow.

The polishCLR workflow can be easily initiated from three input cases:
Case 1: An unresolved primary assembly with associated contigs, the output of FALCON 2-asm: p_ctg.fasta and a_ctg.fasta
Case 2: A haplotype-resolved but unpolished set, the output of FALCON-Unzip 3-unzip: all_p_ctg.fasta and all_h_ctg.fasta
Case 3: A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs, the output of FALCON-Unzip 4-polish: cns_p_ctg.fasta and cns_h_ctg.fasta.

These example data are the input contigs assemblies for the pest Helicoverpa zea. These contigs are built from 49.89 Gb of raw Pacific Biosciences (PacBio) CLR data generated from a single H. zea HzStark_Cry1AcR strain male.

Adult H. zea were collected near the USDA-ARS Genetics and Sustainability Agricultural Research Unit, Starkville, MS, USA in 2011, and transported to and maintained in a colony at the USDA Southern Insect Management Unit (SIMRU), Stoneville, MS, USA as described previously. Larvae were selected on a diagnostic dose of 2.0 μg ml-1 purified Cry1Ac, and survivors used to create the strain, HzStark_Cry1AcR. HzStark_Cry1AcR was back-crossed every 5 generations to a susceptible line maintained at USDA-ARS SIMRU.

A single male pupa (homogametic, ZZ sex chromosome) from HzStark_Cry1AcR was dissected laterally into eight ~20 μg sections. High molecular weight DNA was extracted. PacBio libraries were generated from unsheared DNA using a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), and 20 hour run time movies generated on a single SMRT Cell 1M v3 using the Sequel I system (Pacific Biosciences).

The raw continuous long read (CLR) subread bam files were converted to fastq format using bamtools v. 2.5.1 (Barnett et al. 2011), then used as input for the Falcon assembler (Chin et al. 2016) using the pb-assembly conda environment v. 0.0.8.1 (Pacific Biosciences; default parameters). Falcon-Unzip created primary and alternate contigs with one round of haplotype-aware polishing by Arrow (Pacific Biosciences).

FieldValue
Tags
Modified
2022-11-01
Release Date
2022-02-09
Frequency
Not Planned
Identifier
091870fb-e4be-4112-bb30-4cebc7228d0c
Spatial / Geographical Coverage Area
POLYGON ((-88.857421875 33.408798646313, -88.857421875 33.486144342565, -88.737258911133 33.486144342565, -88.737258911133 33.408798646313))
Publisher
Ag Data Commons
Spatial / Geographical Coverage Location
Starkville, MS, USA
Temporal Coverage
May 1, 2011
License
Contact Name
Stahlke, Amanda
Contact Email
Public Access Level
Public
Program Code
005:040 - Department of Agriculture - National Research
Bureau Code
005:18 - Agricultural Research Service