Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nanopores have unacceptably high error rates. Around 10%


Is this an accuracy or precision issue? I am imagining that if you actually have access to the device, you could do as many runs as you want, getting to arbitrarily low error rates.


This is a common misconception - "averaging out" errors only works if the errors are pretty rare at any given site. This is true for some types of errors & sequencing technologies, but not universally true. Some types of DNA sequences (most notably homopolymers and other simple repeats) are very difficult to sequence correctly, and X% of the reads there will be incorrect. If X>20% of so, then it may look like real germline variation no matter how many reads are sequenced


The errors are non-random. That's why they use machine learning to figure out those errors. You could, of course, also just do traditional statistics on sequences that you want to sequence all the time. I've done that with plasmids before, and it works pretty good. I think there are a few papers on it too.


> The errors are non-random.

Could you elaborate / give an example? Are the errors deterministic? Is it like ISI (Inter-Symbol Interference[1]) in signal processing, where some symbols interfere with the reception of the next symbol(s)? Are there short range errors (one letter) or long continuous errors?

[1] https://en.wikipedia.org/wiki/Intersymbol_interference


It's a complicated issue; I tend to think of the error component of any one MinION observation as being a function of the k-mer in the pore at the time (i.e. the subject of the observation) and, with some decaying dependence, the sequences (i.e. in both directions) that extend out from either side of the target k-mer. You might say that MinION error is a function of the target k-mer and its immediate environment. It gets even messier when you try to imagine the form of that function; for one, it's not _completely_ good enough to remain in sequence space alone: among other things, the "shape" (i.e. the conformation) of that (DNA or RNA) molecule around the target k-mer will influence how the shape of the pore will change in response to the target k-mer, which, in turn, will influence the observed current signal (i.e. manifest as a deviation from the "expected" or "ideal" current signal for that k-mer!). As I understand it, Nanopore don't spend too much time actually modelling k-mer-in-pore dwell-mechanics; instead their best base callers use machine learning to generalise across the swathes of available sequencing data for known targets (and give really quite impressive results, all things considered).


https://gist.github.com/Koeng101/abc674e1acd575646748afcbcc7...

There is a real example I ran a few months ago. How to read it is here https://en.m.wikipedia.org/wiki/Pileup_format

Positions like 172 have errors more often than not because the basecaller is wrong sometimes (note: this is from a sequence verified sample).

The errors come up more often in some sequences than they do in others. I’m not really sure about symbol processing, but if you have any beginner resources for that I’d appreciate them!


don't know why this was downvoted. If I'm not mistaken, there is generally a high error rate per pore fundamentally because it's a single molecule experiment. These get averaged out, but may be difficult to align as it might not necessarily be a straightforward averaging. There are also segments that are fundamentally generally difficult to sequence correctly (single nucleide runs, not even a super high n) that will probably never get satisfyingly resolved no matter how many times you sequence.


Are you sure about that? My last consensus run worked with complete coverage of ~410 bp region. Here is a gist of the raw pileup without consensus - https://gist.github.com/Koeng101/abc674e1acd575646748afcbcc7...

Visually, I think, you can see that it isn't THAT bad (low coverage at the ends is because of how I barcoded the sequences).

I hate to be that guy, but have you actually used the technology? And if so, approximately what year? Unacceptable for what procedure? Do you have any raw reads that have been troubling you?


They mean at genome-wide scales. If you are just doing a 410bp the sequence is short enough that the signal of is going crush and noise you get from strands slipping in the pores.

The errors nanopores get are gaps, not base pair substitutions. So with things like viral or bacterial sequencing you don't really have huge issues.

When you are doing large eukaryotic sequences with lower coverage on average, you start picking up a lot of deletion artifacts. Which isn't a huge deal if you have a very well annotated genome like human, but if you are doing pioneer genomics it can create some difficulties. Often if the genome isn't well annotated, its best to pair nanopore with short reads.


The gaps are usually homopolymers and such, which should get helped by R10 pores. But true, at low coverage, things can get tougher!


That all depends what you want to do with the data. For assembling new genomes, they produce very long reads that are essential for "scaffolding". They're also great for structural variant detection (large rearrangements of DNA). DNA sequencing is not a monolith and there's room for lots of different complimentary technologies.


It should be noted that the "errors" in this case are gaps in sequence. Sometimes the DNA strand slips through the pore and some bases aren't called.

The actual base calling is on par with Hi-seq in my experience. In software terms, you are missing chunks of code, but aren't flipping bits.

This is important because in certain experiments, you care less about those gaps (scaffolding for example). So you can get a lot of cheap utility out of nanopore sequencing.


This is a common, and often justified, though not always fair, criticism. MinIONs have an error rate of around 10% for _any given base_. Moreover, these errors aren't entirely independent of one another, so if you struggle to sequence a given base the first time, you're likely also to struggle if you try again. That said, if your experiment is such that you're only sequencing a guaranteed single target (e.g. one, isolated coronavirus genome), in that one sequencing run (on that one flow cell), you'll "re-sequence" the same any given region many times and, unless you're looking at "problematic" (i.e. low-complexity) regions, you _will_ be able to "average out" the errors to reveal the true target sequence. On the other hand, if you're trying to co-sequence a mixture of closely-related targets, that's when the headache starts...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: