Variants

Intro

For M. tuberculosis, typically one to four thousand variants are called. They are provided in VCF format as well as in a table also providing information on amino acid changes and what is known in terms of association with genotype or resistance.

Handling

Most variants will be SNPs. When located within a CDS, the according gene will appear in column "Region" and there will be an "AA exchange" and a "PAM1". The Point Accepted Mutation 1 (PAM1) lists the probability (multiplied with 10,000 for clarity) for the particular aa exchange to occur, given that 1% of the aa are changed (99% similarity, i.e. for very similar proteins). In practice, transitions between aa equivalent in charge and size are more likely whereas a transition to a most dissimilar aa will yield a small score or even a zero (e.g. Arg→Asp).

The table lists all detected variants, each genome position being represented by one line. If multiple alleles have been called (type=MUL), the sample column lists all nucleotides observed, delimited by commas. For the time being, amino acid exchanges as well as PAM1 probabilities reflect only the main allele which is listed first.

However, if a variant position is located within intersecting CDS, more than one region, aa transition and PAM1 probability are provided, separated by semicolons.

in the "Region" column header adds information about start:stop(s), product(s), and type(s) (e.g. tRNA or CDS).

In column AA Exchance, potential start codons are appended an s in round brackets. Mutations that create or abolish a potential start are not considered silent, even if the amino acid does not change.

Export

  • WYSIWYG
    Under Windows (independent on which browser you use), you can always copy/paste or drag/drop (like this: http://www.mrkent.com/tools/converter/) into an Excel spreadsheet. Whole pages can also be exported to Excel by File/Save as (Filter: all files, manually change the file extension from e.g. .bam to .xls). Opening this file, Windows will complain about the format differing from the one specified by its extension (say "yes") and that it cannot find the .css style sheet (say "Ok"), but afterwards right smartly transform it into a slightly contorted what-you-see-is-what-you-get sort of version of the web page. It comprises format features and hyperlinks but looks horrible. Transform (save as, that is) to CSV and back in order to extract the "pure data" (unformatted table content without further adornment).
    Unfortunately, directly saving the pages as CSV or trying above procedures under Linux will yield a wait page's text instead of table data, which is why we provide an alternative way to extract the table contents:
  • Pure Data
    (table contents, only)
    The export "buttons" (hyperlinks actually) on the Variant, Genotype, and Resistance pages provide a shortcut to export the unformatted table contents into any spreadsheet program. At the moment, this neither works for IE nor Konqueror but nicely (regardless of the OS) for e.g.
    • Safari (save appearing page as .csv)
    • Chrome (click on the downloaded file in the lower left corner)
    • Opera (directly open in Openoffice or Excel)
    • Firefox (directly open in Openoffice or Excel)
    In order not to encounter any strange-looking special characters, the spreadsheet should be imported as UTF-8. Columns are separated by commas, text is flanked by inverted commas.

Computational steps

/usr/local/bin/samtools mpileup -B -f ${h37}sami.fasta ${dir}processed/${bamfile} > ${dir}processed/${bamfile}.mpileup

/usr/bin/java -Xmx16g -jar /usr/local/share/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ${h37}sami.fasta -I ${dir}processed/${bamfile} -o ${dir}processed/${bamfile}.flt.vcf -glm BOTH -mbq 13 -nct 6 -nt 4 -A BaseCounts -A VariantType -rf BadCigar > /dev/null

Versions
SAMtools v1.9
Genome Analysis TK 3.3