About STRipy and the STRs database
STRipy application

STRipy is an application for detecting short tandem repeats (STRs) from Illumina short-read sequencing data. Whole genome, whole exome and targeted sequencing files are acceptable, as long as there is coverage on the targeted STR locus. STRipy is divided into two applications: 1) the Client which is the main graphical application an user interacts with, and 2) the Server where genotyping takes place. We have set up a public server that is running on Cloud which can be used by all users. Therefore, one does not need to set up a server and can use the Cloud for easy genotyping.

STRipy's Client can be downloaded as a stand-alone application from this webpage or you can run the app from the source code (gitlab.com/andreassh/stripy-client). In order to run your own server, download STRipy's Server and follow the directions at its repository (gitlab.com/andreassh/stripy-server). STRipy can run under macOS, Linux and Windows 10/11 through WSL. Please see the documentation page for installation and set-up instructions.

STRipy uses ExpansionHunter (Dolzhenko, et al., 2017) to determine genotypes and REViewer (Dolzhenko, et al., 2021) for visualizing read alignments. From a sequencing file, STRipy extracts out a small fraction of reads that are relevant to the targeted locus and coordinates of mis-mapped reads (if they exist). The output data is then sent to the Server (either Cloud or a local computer/internal network). For privacy purposes, the file name of your sample will be anonymized before sending it to the Cloud and all files analysed on Cloud will be deleted immediately after genotyping. No information about users or samples will be stored. The file will be genotyped by the latest version of ExpansionHunter followed by creating read alignments with REViewer.

Then, results are extracted out from these files, colourised according to the repeat ranges reported in the literature. Please see the documentation page that describes the colour scheme. Finally, a PDF report is created – all of which are returned to the user. You can configure STRipy by clicking on the button on the bottom right corner in the app.

STRs database and population-wide data

STRs database has been compiled by using data obtained from the literature. All reference coordinates has been manually confirmed and presented in 0-based coordinate system. We use repeat ranges derived from the literature for colouring alleles, but it is important to note that STRs are highly complex and these ranges are only indicative, not definite (e.g. pathogenic range might vary between populations, interruptions can affect the pathogenicity, and so on). Population data was obtained by genotyping over two and half thousand whole genome samples in the DRAGEN reanalysis of the 1000 Genomes Project dataset (1kGP-DRAGEN). We used the ExpansionHunter with our own created catalogue to genotype all samples. Replaced and nested types of repeats were genotyped separately by using ExpansionHunter's algorithm only for spanning reads and specifically counting the pathogenic repeat unit. Alleles which were supported by less than 5 spanning reads or less than 50 flanking and in-repeat reads combined were filtered out. This dataset contains phenotypically healthy individuals at the time of genome sequencing, however, some of them could have disease-causing genetic mutations which may manifest at later age.

In the database we have defined repeat types as follows:

  • Standard: repeats where one specific motif (for example, CAG or CCG) is in healthy individuals and its expansion becomes pathogenic at larger numbers of repeats.
  • Imperfect GCN: repeats which encodes for the amino acid alanine, but the sequence can be composed of different, but synonymous, repeat units – either GCA, GCG, GCC or GCT.
  • Replaced/Nested: repeats where the pathogenic motif is not present in the reference genome and mainly not found in healthy individuals. CANVAS is the only example of replaced type repeats, where a stretch made of AAAAG repeats is replaced with a sequence composed of another motif (AAGGG). All the other diseases in this group are caused by nested type of repeats where a stretch made of the pathogenic motif (such as TTTCA) is inserted between or next to the non-pathogenic endogenous repeats (such as TTTTA).

Validation of loci

We validated a locus in each gene by simulating heterozygous as well as homozygous samples where alleles ranged from 60 bp to over 2 kp, using over 60 thousand samples. We simulated 150 bp reads and 450 bp fragments (50 bp SD) and calculated Root Mean Square Error (RMSE) as a measure of accuracy for samples whose long allele was either in one of these three regions: (A) up to read length, (B) from read length to fragment length and (C) over the fragment length. Additionally, we used STRipy on true-positive whole genome samples to confirm its use on real biological samples. STRipy determined an allele to be in the pathogenic range for eight out of nine affected individuals.

ExpansionHunter's catalogue creator and results analyser

We have created an ExpansionHunter's catalogue creator which uses our defined loci and enables to create a variant catalogue for ExpansionHunter for various genomes in an easy way, just by selecting the loci you want to target. For each locus, we simulated long alleles and determined the most likely mis-mapped locations of fully repeated reads in the genome. These coordinates were defined as off-target regions to enable genotyping alleles longer than the fragment length. However, it is important to note that one should be cautious when using ExpansionHunter with off-target regions on real-life samples as this could lead to overestimating long alleles when there are fully repeated reads present on the off-target locus which are not originated from the targeted STR locus. You might want to double-check a sample with STRipy that uses stricter rules for fully repeated reads to avoid overestimation.

We also made an ExpansionHunter's results analyser which takes ExpansionHunter's JSON results file as an input and then graphically outputs genotypes for each locus which are coloured by whether an allele is in normal, intermediate or pathogenic range. This helps to easily assess your results and it works the best with our catalogue creator due to matching names of loci. Repeat ranges are derived from the literature and therefore they are indicative, not definite (e.g. there could be interruptions present in a locus of your sample which is not evident from the genotype and therefore might not cause the phenotype even when it falls to the pathogenic region).


Contact

Please use the contact form to report any issues, give feedback or for any other inquirers. Your message will be forwarded to the author. Alternatively you can also send a mail to . Thank you!

Contact form