Bacterial Genome Sequencing Results

Summary

Sample_3

2024-11-14

Overview of assembly and annotation
  • This is a circos plot generated by Bakta (v1.6.1).
  • It is a circular representation of all of the contigs, and contains information such as GC skew and feature location. See Bakta’s explanation for more detail.

Statistics

Total bp sequenced 419,562,910 bp
Total number of reads 196,194 reads
Longest read 75,291 bp
Raw coverage 33x
Assembled coverage 32x
Genome size (Mb) 12.7 Mb
Number of contigs 3 contigs
Number of genes annotated 10,505 genes


All data is contained within the accompanying folders.




Results


Contig analysis

Assembly graph
  • This is an assembly graph generated by Bandage (v0.8.1).
  • It shows the contigs and their connections in the assembly.
  • Generally, the best graph you can have for a bacterial genome is a single circle (and possibly some other circles if you have plasmids).
  • If a contig is black, it has a relative depth of x1.
  • If a contig is red it has a higher relative depth, suggesting it has higher abundance in the sample.
  • Note: This is a graph of the pre-polished assembly. During Medaka polishing, some contigs may be deleted if they are not well-supported, meaning that this plot may show extra contigs.


Species analysis


Mash

Taxonomic classification
  • This shows the highest scoring Mash (v2.3) hit against a prepared RefSeq database for each contig in your assembly (database download in section 3.1.4).
  • The percent identity is estimated from the number of kmers that matched the reference.
contig length (bp) species est. %ID
contig_1 10,331,010 NZ_LK022848.1 Streptomyces iranensis genome assembly Siranensis 95.3
contig_2 2,309,886 Streptomyces autolyticus strain CGMCC0516 plasmid unnamed7 92.80000000000001
contig_4 23,887
Mash may incorrectly identify bacterial contigs as plasmids or phage

See the FAQ for more information.

Sourmash

An alternative method
  • Sourmash (v4.6.1) has more or less the same functionality as Mash the way we use it, however we query Sourmash against a prepared GenBank database.
  • The algorithm and database are calibrated differently than what is used in the Mash analysis step, so it is often less sensitive in finding exact hits.
contig superkingdom phylum class order family genus species strain
contig_1 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae Streptomyces Streptomyces rapamycinicus
contig_2 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae Streptomyces Streptomyces rapamycinicus
contig_4




Overview of sequencing reads

Plot of all reads
  • Each dot on the graph represents a single read, plotted by its length and its mean Phred quality score.
  • Red dots are reads not used the assembly.
  • Blue dots are reads used in the assembly.
  • The histograms along the top and right show the distribution of read lengths and Phred scores, respectively.
  • median phred line shows the median Phred score of all of the reads used in the assembly.

rN50 rNG50
6,848 bp 22,434 bp
What do rN50 and rNG50 mean?
  • These are assembly statistics that help assess read/assembly quality.
  • rN50 is the “reads N50”, which is defined as the shortest read that covers 50% of the total number of bases in the reads used for assembly. This is a good reference for better understanding N50, though note that we are reporting rN50 which uses reads instead of contigs.
  • rNG50 is the “genome reads N50”. We made this statistic up! However, it’s analogous to a NG50, which takes into account the known or estimated genome size. After assembly, we use the genomic length information to calculate the length where half of the genome is covered by reads of this length or longer.



Assembly completeness


CheckM

Your genome assembly passed the reference gene check.

Is the assembly complete?
  • CheckM (v1.2.2) assesses the quality of genome assemblies. It uses a set of reference gene profiles to assess the completeness, contamination, and strain heterogeneity of a given assembly.
  • This displays the contigs which the specific CheckM reference genes were found. The reference genes are almost always chromosomal, so if a contig plot shows it contains references genes, it is likely chromosomal.
  • Why do my completeness and contamination numbers look slightly off?

Marker lineage Completeness Contamination
o__Actinomycetales (UID1696) 100 3.95




FAQ

Where are my FASTA files? Where are my annotated DNA sequences?

  • If your sequencing run succeeded, these files can be found within the accompanying annotation folder, which was generated by Bakta. Bioinformatics file extensions can be confusing and somewhat inconsistent. Here is a quick guide to some of file extensions you will find in your annotation folder:

    • FASTA files:
      • .fna (contig nucleotide sequences)
      • .faa (protein amino acid sequences)
      • .ffn (gene nucleotide sequences)
    • GenBank files:
      • .gbff (annotated contig sequences)
  • You should be able to open any of these files in your favorite sequence editor (e.g.: Geneious, SnapGene, Benchling, MacVector, CLC, UGENE, etc.).

What is a “contig”?

  • This is a contiguous segment of DNA assembled from the sequencing reads. It may or may not represent an entire genome or entire plasmid. Generally, if your assembly graph is solely comprised of circles, each contig represents a single genome or plasmid.

My assembly graph is not a circle – why does it look like a complete mess?

  • If your plot is not a circle and has a more complicated path, this could be due to several reasons, including:
    • Your DNA sample was degraded and there were not enough longer reads to bridge repetitive regions (investigate your reads to assess). Check out our guide on preparing bacterial DNA samples for sequencing. This is the most likely cause.
    • Your genome is incredibly repetitive due to IS elements or other genomic features.
    • Your genome was very large and there were not enough reads to satisfactorily cover the genome.
    • You had a contaminant organism which reduced the overall genome coverage.

How do you determine which reads are used in the assembly?

  • We remove the bottom 5% of all reads that don’t reach a certain quality threshold, defined by both the Phred quality and length.
  • We further remove excess reads if the total coverage of the reads used in the assembly is greater than 100x.

Why does my Mash analysis say that my contig is a plasmid/phage when that is clearly incorrect?

  • Since Mash is using the RefSeq database including bacterial genomes, plasmids, and phages, it is possible to have a chromosomal contig labeled as a plasmid or phage if it scores higher than the genome it is contained within.
  • This is why we use Sourmash with a different reference database and search strategy – if Mash gives strange results this can be a good second opinion.
  • CheckM will also search bacterial and archaeal profiles giving an independent third opinion.

My genome looks good, but why is CheckM reporting only 98% complete and 3% contamination?

  • While CheckM is a fantastic tool, it rarely scores genomes as 100% complete with 0% contamination, even when run on “perfect” genomes. Biology is weird, and the supposedly single-copy genes that CheckM looks for may be legitimately missing or duplicated leading to slightly erroneous results.
  • Some of the discrepancies may be explained by small errors in the assembly process (see below), but it is more likely that the above is causing the discrepancy.

My contigs are circles and everything looks good – is my genome assembly perfect?

  • It’s possible! However, it’s more likely that your genome assembly still contains SNPs (single nucleotide polymorphisms), especially in regions where ONT reads have higher error rates, such as long homopolymers. It’s also possible, though less likely than SNPs, that your genome assembly contains medium-sized structural errors (e.g.: deleting 50 bp from the end of a contig). For the very best possible assembly, we recommend trying your hand at using Trycycler to assemble your genomes and polishing with Illumina reads. Trycycler is a manual pipeline and can be quite labor intensive. While very high quality, the genome assemblies we return should be treated as draft assemblies.

How is the assembly generated?

  1. Remove the bottom 5% worst fastq reads via Filtlong v0.2.1 (default parameters)

  2. Downsample the reads to 250 Mb via Filtlong to create a rough sketch of the assembly with Miniasm v0.3

  3. Using information acquired from the Miniasm assembly, re-downsample the reads to ~100x coverage (do nothing if there isn’t at least 100x coverage) with heavy weight applied to removing low quality reads (filtong --mean_q_weight 10)

  4. Run a Flye v2.9.1 assembly with parameters selected for high quality ONT reads

  5. Polish Flye assembly via Medaka v1.8.0 using the reads generated in step 3

  6. Run several analyses:

    • annotation
    • contig analysis
    • genome completeness and contamination
    • species / plasmid identification