PanelPlex™ Best Practices Webinar - Lake Harding Association

PanelPlex™ Best Practices Webinar

PanelPlex™ Best Practices Webinar

By Micah Moen 0 Comment February 12, 2020

Welcome, everyone, to today’s webinar. My
name is John SantaLucia , I’m the founder of DNA Software, and it’s a pleasure for me
today to be able to give you a webinar on PanelPlex Bast Practices. Before I begin, I want to thank all of the
users who’ve sent us their comments, critiques, and questions about PanelPlex and PanelPlex
Consensus. I have designed this webinar today to sort of amalgamate the bulk of the questions
that we’ve got, and requests, and with the [inaudible 000031] to directly sort of deal
with some of the features and questions that people have. The talk today is going to be divided into
two main parts. The first half of the webinar will be based on best practices about PanelPlex,
and then the second half of the talk will be about PanelPlex Consensus. So, there’s
two halves to it, each of them will be about half an hour each, and we’ll deal with these
topics that are shown here setting up junctions and regions, how that should be done optimally;
advanced parameters in PanelPlex, how to set up a highly multiplex, batch mode singleplexes,
and two case studies, one on a cancer gene panel and another one on mRNA Splice Junction
design. Then we’ll do a section on PanelPlex Consensus,
where we will focus a lot on making high-quality, optimal inclusivity, exclusivity, and background
panels. We’ll talk about some of the challenges for doing designs for bacteria, and talk about
dealing with cases where viruses in particular have a high degree of diversity and require
splitting the inclusivity into multiple group. I will talk about the challenge of choosing
the best keystone for your panel, and then also give a demo of the software and show
some of the advanced parameters. First major question is, “Which version of PanelPlex should
I use?” They’re two different software. First one is PanelPlex Consensus. PanelPlex Consensus
is software that is for detection of infectious diseases, usually viruses, bacteria, parasites;
we’ve done all of those in the past. This software is meant to do singleplex or
multiplex detection of a single-pathogen panel. So, if you want to detect all the variants,
for example, of Influenza A, that’s would you would use, PanelPlex Consensus to be able
to detect all of those variants of the virus or bacteria without detecting any of the things
that are in the exclusivity or background. PanelPlex, regular PanelPlex, is meant for
doing high-level multiplexing. There’s a variety of applications for that. One application
is for genome enrichment, or next-gen sequencing, where you want to target many different genes
within the same one genome; that’s one application of PanelPlex. It also allows you to do multiplex
tiling assays where you want to cover a full exon, let’s say, with amplicons tiled throughout
the amplicon. It allows you to do batch-mode singleplex. That is when, if you have 100
different targets, you want to get singleplex designs for each one. This allows you to do
it… instead of running the program 100 times with a single target, you can run 100 targets
in one run. Detection of spliced junctions is in this
section of PanelPlex, and also, PanelPlex can be used to do multiplex detection of multiple
pathogens, but in this case, it’s not a consensus design if you’re doing a multiplex assay in
this case. So, those are the two major flavors with PanelPlex, and we’ll be talking about
both of them. We’re going to start right now talking about regular PanelPlex. To talk about that, we’re going to use a case
study that we did for two cancer genes, the HER2 and HER3 genes, with the epidermal growth
factor receptor, type two and type three. These are located in chromosomes 17 and 12
of the human genome, and within chromosome 17, there’s 19 SNP sites that are of interest,
and for the HER3, in this case, there’s 10 different snip sites that we’re interested
in. We’d like to develop a multiplex that will
detect all of them. We want to make sure that whatever design we do does not hit any of
the other low-frequency SNPs that are found outside of these regions, outside of these
cancer-causing SNPs, the other minor variations in the human genome. We don’t want to attack
those. So, there’s a bit of a challenge here that, use ddSNP to avoid the nearby SNPs and
yet have our primers be able to detect all of these 29 of these SNPs. As we’ll see, the
program will whittle this down from a 29-plex down to a 16-plex by combining the different
[inaudible 000429]. Now, in this case, we’re dealing with assays
for circulating tumor DNA, so those are relatively short pieces of DNA, around 140 nucleotides
or shorter, so we’d like to have amplicons that are shorter than that, and that is a
part… that consideration goes into our design. Most challenging part of setting up the runs
for regular PanelPlex is just getting the list of all the SNPs, and these are SNPs and
indels, by the way. We allow for specifying… we call the location of the SNP or the indel
the junction location, and we specify a jmin and jmax for the junction maximum in order
to allow us to specify a SNP site or indels. So, for example, the first row of data here
is data we got from a cosmic database for chromosome 17. There’s a SNP at position 952.
So, since it’s a SNP, we put the same value in for both jmin and jmax. All right? And
here’s a different SNP for the same location. Here’s a third SNP three nucleotides away
at 955 here. So, that SNP is close, it’s too close; we couldn’t design two separate PCRs
in a reaction at the same time would be able to separately amplify something at 952 and
something at 955. They’re only three nucleotides away. So, all three of these targets that are highlighted
in yellow are targets that we would combine. We’d make a new jmin of 952 and a new jmax
of 955, include all three of these. The program does that process of combining the assays
automatically. It recognizes things that are close, so, all the ones in green are also
a group; they form a group that are too close to have separate assays for each one. You
can see in this case, they go from 3,966 to 4,008. So, they have a range there of 42 nucleotides.
That’s not big enough for a separate amplicon. All those clusters that… I’m going to review for you a little bit about
how the software does this sort of clustering, and where the calculations come from for the
design regions, for the forward primer and the reverse primer. Shown here is an example
for the TP53 gene, a tumor suppressor gene, just as an example. This is a particular SNP
site that we would like to be able to detect, and the software allows you to specify a number
of things. One is it allows you to specify a minimum nucleotide, number of nucleotides,
on either side of the junction. That’s abbreviated as Min Nucs in this particular slide. That’s just spacing to sort of allow you to
have regions to do sequencing. So, whatever result you get, you’re going to have enough
sequence there to be sure that that wasn’t a false amplicon or false read. So, you can
differentiate that. Now, the other issue is, how do we determine
where to start with the forward primer end design region, the start and end of that,
and where do we make the reverse primer design region? The start and end of that. Those are
really determined based on the length of the amplicons that you want, which I’ll show you
in the next slide. But while we’re on this slide, I do want to
mention that outside of this critical region where this SNP site is, whatever these designs
regions are for the forward primer, there just happens to be a SNP site, a high-occurring
SNP site, within that site. So, we want to avoid having our primers land there because
if they did, then there would be a certain segment of the population for which the assay
might not work. That SNP variation might make the primer fail. So, basically, the program automatically is
interfaced with ddSNP to detect those sites. We have a default setting at one percent,
so any SNP that occurs more than one percent in the human genome, we avoid having primers
go in those locations. You can adjust that to a higher value or lower value if you desire.
It’s in the advanced parameters, which I’ll show later on. When you submit your job, any sites, any SNPs
that were detected in the locality of your design, the email with your results shows
you where those SNPs were that they avoid. Now, this slide is meant to show you a more
detailed presentation about how the regions are calculated. So, jmin and jmax are what
the user initially specified, and then beyond that, we have the maximum length of the gap
and minimum nucleotides and the maximum [inaudible 000855]. Those all get computed to start where
the forward primer start position is, where is the reversed primer ending position; all
of them need to be specified. But these details are given in the PanelPlex
manual, but if there are questions specifically you have about these, because you can manually
set these… if you see how PanelPlex sets them up and you want to change one of them,
you can do that. You have to re-run the program, but let’s say there’s a SNP right near the
end here and you want to avoid it, or some other reason why you want to move the FP start
position, you can do that in that junctions file. So, the junctions file is this file I showed
you here. This is an actual picture of a junctions file. So, you have to give a name for each
junction, the accession; that is the chromosome that it’s coming from, and then the position
within that accession, where the junction is. so, you can either specify the junction, or
you can directly specify the forward primer start and end, probe start and end, revere
primer start and end, and reverse transcription start and end if it’s a RNA. That’s what that
file’s all about. Now, in the event that you have clustered
junctions like we have in this case with HER2 and HER3, we have multiples junctions that
are all right on top of each other, it would be impossible to have individual PCRs for
each one. That’s why we specify the jmin is the sort of first one in this group, and jmax
is the land SNP in the group, and the program will automatically take the raw values that
you gave it and figure out this clustering, and set up the forward primer region and the
reverse primer region to accommodate. There are certain cases where junctions, or
the SNP sites, are far enough away that you could make separate amplicon. But they’re
close enough that you can’t maybe… Well, if they’re very far away, then you’d just
make two separate amplicons, but if they’re in that middle region where they’re close
enough where you could make a choice of either amplification reactions or having [inaudible
001050] amplification reactions, combine them. So, in this case here, you can see that the
forward primer design region for junction one overlaps with the reverse design region
for primer two, and we can’t have that. If we kept it like that, then it would be possible
for the forward primer of one of them to be very close to the reverse primer of the other,
like a mini amplicon. So, the program is aware of that problem, and it re-sets up the design
one of two ways. It gives the user a choice. Do you want to combine, or do you want to
split? So, this is one of the common questions we
have what’s going on when we do split versus combined? What’s happening when you do split, we’ll
take that choice first, is it’s trying to make two separate amplicons for junction one,
and a separate amplicon for junction two. All right? And it does allow for a little
bit of overlap between these design regions. In principle, if you had the five prime ends
of your primers overlapped a little bit, that would not cause a mini amplicon; it’s only
if you get the three prime ends of the primers overlap that you create problems. So, we allow
up to a 13-nucleotide overlap, and then we get two separate amplicons. Now, by choosing split, that will result in
a larger plex size for your PCR, but it also limits the design’s base because everything
is crunched up here. In order to accommodate these two sets of primers, instead of having
a wide forward primer range and a wide reverse primer range, they had to be narrowed to prevent
that sort of mini amplicon problem. The other alternative is to use the combine
option. Combine option will make it so that one amplicon will amplify both junction one
and junction two into one bigger amplicon. But you can see here, this also restricts
the design range, but you do get a smaller plex. And generally, we recommend using the
combine option, but either one, it can work in different circumstance. Another use case of the PanelPlex software
is to detect mRNA splice junctions. So, for messenger RNA assays, it’s desirable to amplify
the RNA in preference to not amplifying the genomic DNA. And one of the tricks to that
is to choose the splice junction sites. The reason to choose a splice junction site is
because in the genomic DNA, if you choose a splice junction between two exons… so,
the forward primer’s in one exon, the reverse primer’s in the other exon, then for the genomic
DNA, that large intron will separate the primers and make the amplification inefficient of
the genomic DNA. And then, second of all, the amplicon made
by the mRNA is smaller and thus more efficient, but also we recommend generally putting the
probe in a location to straddle that junction. So, that can make it so that the probe only
binds to amplicons made from the messenger RNA. An amplicon made from the genomic DNA
wouldn’t work because the genomic DNA does not contain the splice junction sequence.
So, that makes it unique. All right, so, that’s sort of a general strategy
for setting up a splice junction, or… detecting mRNA, specifically, in the presence of genomic
DNA. Now, one of the other challenges that comes up with this topic is that you need
to find conserved splice junctions. That is, sometimes there are splice variants and you
don’t want to choose a junction location that’s not conserved. There’s three different splice
variants that are all biologically important for your particular purposes, you want to
make sure you choose a splice junction that is actually found in all of them, and not
left out of some of them. So, I’ve made a little [inaudible 001437]
is one particular one. This is a ribosomal RNA protein, [inaudible 001443] protein 13,
that is encoded here. There’s three different copies of this messenger, or three different
variants of this messenger RNA; it gets spliced differently, so all three of these different
messenger RNAs have a spliced junction at these locations that are conserved. But the splice junction at this location,
in the middle here that’s shown in red, that one is not conserved. One of the splice junctions
would be different because it does not contain one of the exons. One of the exons have been
left out. So, two out of the three are the same, but one is different, so that one, it’s
not conserved. That would not be a good site to target for an assay. Some other ones are marginal, like you might
have a site here that’s conserved; you’ll notice that the exon on the left-hand side
is the same in two of them, and one of them’s a lot shorter. So, we would want to check
that to see if that is a conserved splice site or not in more detail. But there’s some
other tricks you can do. You can use these pictures here from… This is a picture from GenBank looking at
the RefSeek annotations here. But you can also go and actually get the actual positions
of the exons within each messenger RNA, and then line them up and see, are the exons the
same length? If they are, those are likely places where you could target your site. Another challenge for these is dealing with
pseudogenes. Pseudogenes can really throw a wrench in this because pseudogenes are often
the result of taking that spliced messenger RNA and making a copy DNA, and then inserting
that copy DNA into the human genome. But those look quite similar if not identical sometimes
to the messenger RNAs, and that can be a concern and something to look for. Make sure that
your assay is not picking up pseudogenes. Doing a multiplex of those messenger RNAs
is highly doable. What with recommend doing is collecting the conserves… different targets;
in this case, I have four different targets. These ones that are highlighted in green are
the junctions that are conserved. So, now we want [inaudible 001654] choose one splice
junction from each of these different four targets and put them into a junctions file.
So, here are these four different targets; for each one, I chose one of the splice junctions
for each one. And you just run that as a multiplex [inaudible
001711]. And if you get a good result, you’re set to go. On the other hand, you might find
that your choice is not a good choice. Maybe [inaudible 001720] happens, you get a low
[inaudible 001721] and that is an indication that a particular [inaudible 001727] make
a mini amplicon that you want for these assays. And we found two of those in this case that
I’ve highlighted in red. These were regions where the RNA was just so full that we couldn’t
get the primers to bind tight enough within the region that we gave it. So, we have other sites. I’d just choose a
different splice junction until you find one where all of them are compatible. And I want
to show you a little bit about a demo on PanelPlex and use that as a means to answer some other
common questions that we get. This is right on the one that all of you guys use. Choose the human genome as the target, and
hit Next here. All right. So, typically you would give this a name like HER2, HER3; we’re
going to do a little assay on that. Now, we have a number of choices here. We can choose
to do a multiplex PCR or we could do… If we choose multiplex, then it’s going to try
to design all the primers so they’re compatible with each other and can work in a single reaction. If we choose batch mode singleplex, then what
that does is allow you to get a single PCR assay for every single target. In this case,
it would make 29 different PCR reactions for this HER2 and HER3. If I chose automated tiling,
this is for when you have a single target and you want to have PCRs that are lined up
one right after the other and design all of them to cover an exon, for example. I’m not
going to cover that example any further today. I want to make sure we get through this multiplexing. Then in the detection type, you want to choose…
Sequencing means there’s going to be no probe design. If you want to design a TaqMan probe,
that doesn’t make it any harder to do the design, but it will do both the primers and
the probe if you do that. Right? So, sequencing is this sort of typical mode that people will
use. Minimum amplicon gap means that the three
prime ends of the reverse primer and the forward primer must be at least 25 nucleotides away.
So, all solutions that don’t satisfy that are not going to be shown to the user. Maximum
amplicon gap 95 nucleotides. So, if the gap is 95 nucleotides, then the total amplicon
would be 95 plus the forward primer length plus the reverse primer length. So, you get
amplicons in this case around 140 nucleotides, if you choose that. This is a setting that is a useful one to
set to a bigger value. If you’re not dealing with circulating tumor DNA, then this is something
that, in order to give the software a little bit more freedom to do the design, I would
set that to 120 nucleotides. All right, so, this next set, primer length
range, this is also a sort of advanced user setting here. If you have a very AT-rich sequence,
you know that it’s AT-rich, then this setting of 28 is usually big enough. But if you had
something really crazy, you could set that up a little higher, maybe up to 32 nucleotides.
If you know that your target is very GC-rich, then you might want to maybe set this region
a little bit smaller than 23, down to 18 or so. But this range would allow you to sort of
capture every possibility. The default settings work, like, 99% of the time, but sometimes
you have an extreme target where it’s useful to set those ranges a little bit. This, I
just opened up this file that had the 29 targets from HER2 and HER3, and it automatically process
it and asks us this question “Do you want to combine, or do you want to split?” So,
there’s just a few where it could have this choice, and I wanted to choose combine, in
this case. So, when it did that, it took the 29 different
HER2 and HER3 variants and it combined them, and it automatically set up the forward primer
range, the junction range, the reverse primer range, so, these design settings are all there.
And you can see which ones it combined. It combined this 488 mutation with the 385 mutation,
for example. So, two of them are combined here. Here, there’s a cluster of eight of
them. Here’s a cluster of six of them. Insertions and deletions as well as SNPs got combined
here. But it was smart to combine them all in this sort of intelligent way. The sort of default settings we have here
for the primer concentration, temperatures, and for the buffer, these are generally applicable
but if you have something in particular, your [inaudible 002149] temperature would be a
particular one that folks would design… Maybe they do 60 degrees, some other number;
whatever your assay temperature is, put that in there. That one is important. The primers are designed to work at that 60-degree
temperature. The probes we designed to work at the extension temperature, if you had probes
in your design. In this case, 72 degrees. And then, if you have tails that you want
to add to each of your primers, you can do that here. If you have a universal primer
sequence and a spacer sequence, you can… generally, the spacer, we can’t put in a random
nucleotide for a spacer. We just recommend putting a bunch of A’s. That’s a way to run
that. So, you can put universal tails on your primers if you would like. Then going to the advanced settings here,
there’s lots of features here that are described in great detail in the user manual. I’m not
going to go through every single one of them, but I’m going to give you some highlights
here of some things that are settings you might want to consider changing. Down here, the minimum SNP, the [inaudible
002247] frequency, I told you we have it set to one percent. So, rare SNPs in the human
genome, anything more than one percent will be detected and the primers will avoid that.
If you want to set that higher, make it more permissive, you’re welcome to do that. Or
you can turn it off altogether by setting that value to one. Set it to one, then everything
would be… it would just ignore this SNP. Some other settings here are a little more
advanced. This one, number of solutions to output; some users don’t want to get deluged
with hundreds of different multiplex solutions, but we allow up to 254 solutions that you
can output, and there’s no extra cost and time. It’s not any harder, it’s just amount
of output of alternative multiplexes that you can bear to see. So, in this case, you
can do 254, or one. One is the default setting. Most people just want to see that that’s multiplex
and try it, see what happens. Some other things here, this panel limit size…
What happens is the space of multiplex, the number of possible multiplexes you can do
can be [inaudible 002354] exploding. So, if you have 100 targets, if I saved 10 different
primers for each of those 100, then I would have 10 to the 100 power, possible combinations.
So, we generally recommend using four. This would mean for every single singleplex in
your reaction, we’re going to save the four top designs and those are candidates for multiplexing. If you find that your multiplex fails, you
try it and it just cannot find a solution to it, then you could set that a little higher.
You could bump it up to eight solutions per panel. But that will increase the runtime,
approx… If you double the number of panel [inaudible 002436] you’re going to double,
more or less, the runtime of the program. Those are sort of the main features that I
wanted to show you from the regular PanelPlex. So, we’ve come on exactly 30 minutes of the
talk. I want to now take you back to the presentation and now begin talking a little bit about PanelPlex
Consensus, all right? So, let’s spend another half hour just on that topic. All right, so, the first topic with PanelPlex
Consensus… Let me make this full-screen again for you. With PanelPlex Consensus, we
have the challenge of gathering together all of the variant genomes that we want it to
detect. For example, for the Zika virus, we were able to collect 168 complete genomes
of the Zika virus, and we want to make sure that whatever diagnostic we develop is able
to detect all of the variants of the Zika virus. So, that’s our inclusivity set. The inclusivity are variants of what you want
to detect. The exclusivity are near-neighbors, things you don’t want to detect; those are
things that might cause a false positive in your assay. And generally, the things that
we put into the exclusivity are organisms that are either phylogenetically related to
the Zika virus, or are examples that might actually show the similar symptoms as the
Zika virus, and we want to make sure that any test we have would not give a false positive
to those. So, in this case, we chose a series of fever-causing
viruses like Dengue fever causing virus, the [inaudible 002611] virus [inaudible 002612]
et cetera. So, we put those into a exclusivity plan. But it’s important that these are somehow
near-neighbors. You don’t want to put the whole kitchen sink into the exclusivity, because
that’s not how the program works. And on the other hand, in the background are
anything in your sample that is a contaminant, you do not want to detect. For example, if
I’m trying to make a Zika virus diagnostic, I would not want a false positive [inaudible
002638] human genome, and since Zika virus is a RNA virus, we wouldn’t want a false positive
to one of the human messenger RNAs. So, I would put the human RNA RefSeek in this background,
as well. Gut microbiome, things like that, the human microbiome, those are things that
also could be a contaminant in your sample that could cause a problem. So, unrelated fever-causing virus, like Influenza
A; it’s not related to Zika virus at all, but it could present in the clinic similarly
in a way that you might want to say, “Well, let’s just make sure an assay doesn’t give
a false positive for that.” So, that’s sort of the way we set it up. We have this inclusivity,
exclusivity and background, and now I want to give you some more detail about best practices
for each of those so that you have some good guidelines on how to do that. Before I do that, I want to talk about the
two key ideas. The quality of the inclusivity, exclusivity and background databases is absolutely
essential to getting high-quality design. If you have a garbage database that you input,
then no matter how good the software is, you are not going to get good designs out. So,
it is really important for the user to spend some time to validate their databases to make
sure that they’re good. One bad sequence can spoil the whole design,
so, one bad apple can spoil a whole bunch. Those are very important philosophies to keep
in mind. We have become more and more aware, based on projects we’ve done for many customers,
that this issue of database quality of is often the most time-consuming step of the
whole process, and we are actually under development as we speak, developing methods to check your
database quality. But nonetheless, let’s get into talking about some more best practices
about this. First of all, just the definition of a inclusivity
playlist; this is from the user manual. The inclusivity playlist is the collection of
target sequences that you want to detect. For example, variants of the Ebola. Now, I
have a extensive slide here that gives all of our recommendations for having an optimal
inclusivity list. We’re going to be providing these slides to anyone who wants them. We
can go through those in detail, but I want to hit some of the highlights here of some
of the recommendations. The first recommendation is that the inclusivity
list is very important to have high-quality sequences. You don’t want to put all sequences
that you could find from all the databases in the world that are partial sequences. You
want full-length, high-quality genomes here because, if you have poor-quality sequences
in there, what will happen is the design will not be able to find a solution even though…
you can have a SNP in a particular site that’s really not a real SNP; it’s actually a sequencing
error. All right? Or some of the sequences have sequence missing
in them, and so, the program will try to avoid those regions because if some of the members
of the inclusivity are missing a sequence, it’ll avoid that. And those might have been
great locations to do a design. So, generally, you want to collect as many full-length genomes
as you can possibly get, and here are some of the databases we commonly use. There’s
the NCBI viral genomes databases, Virulogical is a wonderful organization that has curated
databases for high-value viruses; the [inaudible 003012] Virus Database and Viper. Those are
the ones that we use very commonly for viruses. For bacteria, there aren’t as many resources,
and we have to sort of live with GenBank and trying to wade through those, generally. Buyer
beware, caveat emptor the databases are notoriously unreliable, and you should really be fully
aware of that. For example, we recently did a project for Bacillus anthracis where some
of the sequences that were called Bacillus anthracis were in fact misnamed, they were
misclassified, they should have been called Bacillus cereus. And also, some of the exactly sequences for
Bacillus cereus, some of exclusivity sequences were in fact Bacillus anthracis. So, they
were misnamed, and as a result of that, that sort of wreaked havoc with trying to find…
So, what we saw was we ran these assays initially and found low coverage. We got 70% coverage,
kind of thing. Then when we looked more carefully, we realized, “Oh, it’s because it’s got Bacillus
cereus in there,” and once we weeded out the Bacillus cereus out of that inclusivity, and
then we also did the exclusivity had some anthracis sequences in there, we had to get
those into the inclusivity. So, basically, we sort of had to do a little
bit of work to curate those databases and make them more accurate than what were just
off the shelf. So, that’s a definitely, really important to do that. Also, a lot of the annotations
in GenBank are incorrect. For example, in the Brucella suis, Brucella abortus, those
genomes have two chromosomes, chromosome one and chromosome two, and different groups,
they did their sequences and labeled the chromosomes differently. That created problems, because we had chromosome
ones in our inclusivity, chromosome two in our exclusivity, and we don’t want to be mixing.
We don’t want to put things that should be in the inclusivity in the exclusivity because
the program is going to try to avoid hits to the exclusivity. And you don’t want things
that should be in the exclusivity in your inclusivity. That’s just a very, very important
general principle. One of the things we do to check our playlist
is we make a preliminary plan. So, we just take [inaudible 003225] our sort of data dump
from as many databases as we can get, and we use ThermoBLAST to create that playlist,
that list of sequence. If you use ThermoBLAST then download the playlist, and then that
download, what’s nice about it is it’s a very short summary of everything in your playlist.
And one of the nice features is it gives the length of that particular, each accession. So, you can use that length, and I recommend
sorting by the length. What you can find immediately is some of the sequences are very short, and
if they’re very short, that’s something you can weed out of your list. For example, you
could also do this from the graphical user interface. You can go to the playlist management
part of our software, and it shows you, for each one, you can see the name of the accession,
sort of the annotation, and you can see the length of those. You can see some of these
HIV genomes are full-length, 9,500 nucleotides or so, and some are just very short fragments
here. And we don’t want to be mixing and matching
short fragments with long fragments in the inclusivity, unless you’re targeting a specific
gene. Generally, you want to have full-length genomes. In the care of HIV, there’s no shortage
of genomes. There’s more than 10,000 full-length genomes that have been sequenced to date.
So, you don’t want to have all this stuff. The reason you don’t want the short sequences
in there is that it will bias the design to try to cover as many of those short sequences
as possible, which means that it will discount the whole rest of the genome. And if some
of your sequences are [inaudible 003359] to one gene, and some are [inaudible 003401]
to a different gene, you’ll get different results there. Or, poor results for the coverage.
So, generally, in the inclusivity list, we want full-length is the best [inaudible 003412]. Generally we recommend that our users try
to use the full genome search; let the program tell you what the best sequence is to go to,
but sometimes people have strong feelings about a particular gene they want to go to,
and you can use PanelPlex to say a particular gene site. I will tell you that generally,
as a general rule, 16S ribosomal RNA is a very poor choice for a gene for making assays
because it’s conserved. It’s widely conserved throughout the entire bacterial domain. So,
it’s really not a good choice to try to get specific assays to a specific pathogen when
so many things are going to look very similar. A general rule of thumb is that members of
the inclusivity should be about 90% sequence-identical to each other to form a group. If you find
that your sequences have less sequence identity than that, if you have members of your inclusivity
that are more divergent than that, then that is an indication that you might want to break
your panel up into two or more panels, two or more separate designs, because at that
point, they’re really not the same virus. If they’re more than 10% different, most biologists
consider that to be a different species of the virus. So, you’re really dealing with
two separate things. For example, we did a test once for respiratory
syncytial virus, RSV, and we at first just dumped all the RSVs we had into one list,
and then we found [inaudible 003542] coverage is around 50% and realized, “Oh, this is really
RSV-A and RSV-B.” So, sometimes you don’t realize that until you run the test. Another thing for quality is to look for these
sequences… you know, if you actually look at your sequences, you can see they have a
lot of ambiguity codes in them. Those are poorly sequenced, they’re usually not very
reliable; the software will allow you to use them, but it will avoid those ends in the
design. And if you have several of these with sort of a shotgun, parts of it that you can’t
design to, and you have multiple of them like that, it will result in not getting a successful
design. Making a successful exclusivity list here,
with the exclusivity… the exclusivity is much more tolerant. You can put as many poorly
sequenced genomes or partial genomes in there as you want. The exclusivity cannot contain
genomes that are greater than 90% similar to the inclusivity, all right? So, as I said,
with the [inaudible 003637] there, if you have a sequence that looks like the inclusivity
but it’s in your exclusivity, that will really mess up the design as it’s trying to make
things that are specific for the inclusivity. So, as I mentioned earlier, this exclusivity,
you want to put near-neighbors in this, but you don’t want them to be too near, okay?
They have to be less than 90% similar. All right, for the background, the background
is even more permissive. This is a place where you put in… you can think about putting
in the kitchen sink here. Human genome, human RefSeek, the human microbiome, soil microbes,
things of that nature. You can put in a lot of things into this background that are things
that might be contaminants in your sample that might cause a false positive. You don’t want to put all of NR in here, the
non-redundant database. Why? Because it’s very unlikely that the chimpanzee genome is
going to be a contaminant in your actual sample. So, you don’t want the chimpanzee genome,
or the rat genome, or the mouse genome, or all the other genomes that are in GenBank,
you don’t want those here. You just want things that could be possibility a contaminant in
your design, or in your sample. Now, another topic, I alluded to it a little
bit, is the idea of separating that inclusivity into groups. I had that idea with the respiratory
syncytial virus. It really was two different viruses and needed to be split into two different
ones. How do you know which ones to put in which? Well, one way is just run PanelPlex and you
see 50% coverage, okay, take those 50%; that’s one group. All the ones that were not covered,
that’s a separate group, and we do now have a place where it tells you the accessions
that were not covered. So, you could just do that. Another is to do this up-front work with a
multiple sequence alignment algorithm. Now, this is something, again, we’re working at
DNA Software to make a version of multiple sequence alignment that would allow you to
do this automatically. But for now, you would have to do this yourself. So, we recommend
the tool at the European [inaudible 003838] Institute called [inaudible 003841]. This
works great for up to 1,000 sequences. If you have more than 1,000, it’s not going to
work. If you have long genomes, that also doesn’t work. So, this works mainly for viruses,
not so well for bacteria. So, this problem’s with limited number sequences,
limited lengths, are one of the reasons we want to make a native multiple sequence alignment
algorithm for PanelPlex. That will be coming out in the future. But one of the things you
can do is if you have a [inaudible 003911] and it has not more than 1,000 of them, then
you can do these sequence alignments, and these produces a phylogram. The phylogram
shows you the groups that are most closely related; in this case, showing an example
with the Lassa virus. We found that there were seven different groups
that Lassa… and [inaudible 003930] seven different groups, within the group, they had
93% similarity to each other, average similarity. That, I call Group A. Group B had 93% also.
Group C, 97% similarity, et cetera. So, much better to start off trying to get these separate
assays for each of these different groups rather than trying to do one assay for Lassa
is impossible. The virus was in fact seven different viruses. They’re all called Lassa,
but they’re really so different from each other that they really are different viruses. So, doing the multiple sequence alignment
is helpful for that. Another challenge is this idea of the keystone. How do you choose
the best keystone sequence? First of all, what is a keystone? A keystone is the sequence
that is going to be used to do design. When we do design of the primers, it actually is
only designing to one sequence. That is the keystone sequence, and it’s going to make
primers that are a perfect match to that keystone sequence. Now, the design is only to one, but then it’s
checking to see, we actually score each one of those primers, how many of the inclusivity
does it cover? How many does the exclusivity does it falsely hit? Right? So, that design
is actually taking into account the inclusivity/exclusivity in a indirect way. But the actual thermodynamics
of designing primers is done on the keystone. What you want to do is, the idea here is you
want to find the sequence that is the most like all the other sequences in the list,
the inclusivity list. And I have an analogy here to just distance. Consider these five
dots here that are spaced away from one another. If I just asked you which of those five points
is closest to all the other points, maybe by eye you would recognize that point three,
point number three here, is the closest to all others. But suppose you didn’t know that. How would
you go about rigorously determining the point that’s closest to all other points? Well,
I could measure the distance from one to two, from one to three, one to four, one to five,
and get the average distance from one to all others. All right? If I do that, or some [inaudible
004145] distance; in this case, I get 10.5 if I add up these distances. All right? If
I then look at point number two and get all of its distances to everything else, well,
it’s close to one, it’s a little further from three, four and five. If I added those distances, point number two
has a total sum distance of 7.5. point three has a sum distance [inaudible 004205] six.
Point four to all others, 6.5; point five has 9.5. so, you can see the one with the
shortest distance to all others, adding them all together, is point number three. now, the equivalent of that idea, this point
number three here is the one that we would call our keystone. That’s the one most similar
to all others. We want to find something similar to that with our sequence databases, and we
can also use a sequence alignment algorithm to do that. But one of the outputs of the
multiple sequence alignment algorithm is this percent identity matrix; it’s a key output.
This percent identity matrix compares, in this case, all 45 sequences to each other
to figure out their percent identity for each [inaudible 004251] percent identity. If I take the first sequence and I compare
it to all the other sequences in the list, those are their percent identities, and I
average those, I got 91%. So, this sequence is 91% similar, on average, to all the others.
If I do that for each one, I find out over here, this sequence, I look at its percent
identity to all others, it averages 97% identity to all other sequences in the list. So, this
one is the center of the phylogene. In other words, this sequence, like that point number
three, it’s the more similar to all others, right? You can see this is great. 97% identity,
97.8, this would be… Think about the challenge that we’re giving
our target analysis or a fast compare algorithm. You want to find the conserved regions, right?
So, finding conserved regions where all of them are within 97% of each other, is a lot
better than finding conserved regions where they’re 90% identical to each other. This
brings us to then talking about some of the challenges of bacteria versus viruses. Viruses,
well, they’re small, so that makes them easier, between 1,000 and 40,000 [inaudible 004358]
typically, unless you’re dealing with a [inaudible 004400] virus. But most viruses are below
40,000. The genomes are linear; they’re usually complete,
high-quality genomes. But usually, there are a lot more virus genome variants, more variation
viruses than there are in bacteria. So, that idea of more sequence variation requires a
very thorough algorithm in order to find regions that are conserved, and our target analysis
algorithm [inaudible 004423] is appropriate for dealing with those cases. But it has a
limit. It only goes up to 40,000 nucleotides. So, either if you were doing a virus because
it’s less than 40,000… if you’re doing a bacteria, you’re going to need to specify
your [inaudible 004436] what gene you want to do. Now, bacteria have their own challenges.
For one thing, bacteria are much larger than viruses, about 1,000 times larger. Also, their
genomes are usually circular genomes. That creates its own problems. So, there’s a circular
permutation of the genes; that is, there’s not an agreed-upon place, if it’s a circle,
where to start the genome sequence. so, different groups will submit their genome
sequences with different starting locations, and that is very bad for doing sequence alignments.
In addition, there’s no agreed-to which strand is the sense strand, and which strand is the
antisense strand. Well, circular genome, and you can’t really tell. So, unless everyone’s
agreeing in a particular bacterial field which one’s sense or antisense, then you get a mixture
of sequences, of bacterial submission genomes that some are sense, some are antisense. This can wreak havoc with doing design, and
we need to deal with all that. So, we’ve made this new algorithm called fast compare, which
takes all of these issues into account and allows us to run whole genome search even
for bacteria, and takes into account that they’re circular genomes. All these things
are taken into account in a transparent way. All you need to do is there’ll be a new feature
called [inaudible 004553] usage that, you just select Use Fast Compare. If your sequence is longer than 40,000, it’ll
force you to use fast compare. If your sequence is shorter than 40,000 you can choose either
of the two methods. Here’s PanelPlex Consensus, when you just sort of choose a… Here’s PanelPlex Consensus. Here we need to
choose an inclusivity and an exclusivity and a background, so I would have had to pre-make
these panels. I already made a Zika virus inclusivity panel, a Zika virus exclusivity,
and you could choose the human genome as your background, for example, or whatever playlist
you make up. All right? Next… Here we have a choice of choosing different
keystones. Sometimes there’s a good reason, you know a keystone… like, this Brazil isolate
is one of the clinically most important isolates, and I might choose that as my keystone. Another
idea would be for you just suggest a keystone, because we did that multiple sequence alignment
ahead of time and you know which one you really want, but that’s a general principle there
for choosing the keystone. Beyond that, I’ll just continue showing you
a few more of the details here. We can give a name for some bacteria. Oh, this is the
Zika virus; I’ll give it a Zika name here. You can choose whether you want to have TaqMan
detection, or whether you want to have the sequencing primers, can do either. If you
choose whole target, then it will try to use the entire… Zika virus is about 10,000 nucleotides.
You don’t need to find the gene; it will find the most conserved sequence among those Zika
virus inclusivity that you gave that are conserved in the inclusivity, but not found in the exclusivity.
And also that don’t hit the background. You can also specify a design range; if you
know a particular gene and you know that it occurred between nucleotides of 1,000 and
3,000, then you can limit it to a particular gene if you want. But generally, if your sequence
is less than 40,000 nucleotides, we recommend that you use whole target. Now, this version
of the software doesn’t have the fast compare in there yet. Once you have fast compare there,
you’d be able to even do this for whole target even for bacteria, but you’ll have to use
fast compare for it. Moving forward, yes, again, we have some basic
hybridization conditions and strand concentrations. I don’t think we need to belabor those points.
In the advanced configuration, I do want to go into here and give you a few words about
this particular advanced parameter. Some of the sort of highlights. Number of candidates to output; here, the
default is 100 candidates, so it’s giving me 100 primer pairs. Usually that is more
than enough for you to find some really great primer solution. This setting right here is
one that is very important setting number of primer pairs per solution. So, this is
if… let’s say in the Zika virus case, if it’s able to find a set of primers that will
cover 100%, all 168 variants of the Zika virus, then one solution is enough. A singleplex
is enough to cover that assay. If, on the other hand, let’s say 80% of the
Zika viruses are covered with one primer set, but 20% are not. Then the program will cycle
through and it will, as shown here, it will find a set of final designs, let’s say coverage
is 80%, so the 20% that are not covered will get redesigned automatically. [inaudible 004917]
come through and make another set of primers, so that’s what will happen on round two, a
separate set of primers that are compatible with the first set of primers, so, a multiplex
will be made here. And this will allow up to three rounds of
design. If it finds a good solution after one round, it’ll stop, but it’ll do up to
three rounds. So, if it’s going three rounds, it’s literally doing three full rounds of
design, so the program will take longer to run if you give it… A minimum amplicon [inaudible 004947]. This
says that a primer must bind at least 30% to be considered a valid primer. So, if you
have a very folded target, like an RNA target, and it is impossible to bind that tightly,
then you’ll get a failure. The program will say, “Low amplicon percent [inaudible 005006].”
We’ve been working very hard, by the way, to make error messages more informative for
the user, as we’ve heard your pain on those issues where, “Hey, it failed but I don’t
know why.” We’re really trying to figure out the reasons why failures happen, and give
informative answers to the user that allow them to adjust their panels appropriately. So, this is one, if you have an RNA target,
you could consider lowering that. It’s okay. If it’s 25%, it might be okay. It still works.
But typically, these are only for that first round of PCR, you get a low binding; after
it starts making the amplicon, then it is okay, but it’s getting that initial round
of PCR to work is the most challenging part, and that filter is very important. Changing the length ranges can also help.
If you can tolerate longer amplicons, then by all means, give it a little freedom to
have a larger amplicon gap here. Maximum amplicon gap of 200 even is just fine for pathogen
diagnostics. I think that’s pretty much the advanced parameters
that I wanted to cover for you today for regular PanelPlex Consensus…

Add Comment

Your email address will not be published. Required fields are marked *