But there's not just one way that sequence can occur. It could be at any point in the genome. A google search tells me the genome is 29.903 bases long, and if we assume it could occur at any point that's 274877906944/29903 = 9192318 (a bit less than one million).
But of course this isn't the only sequence that would be notable. We have to account at least for every patented sequence, because each of them would be newsworthy, right? This source claims that over 40K human DNA sequences have been patented. Now COVID isn't DNA nor is human, but we have to make some simplifying assumptions here since I couldn't find the data on patented RNA sequences.
So we have 9192318/40000= 229.8. We could bring that number even further down if we consider how some sequences are more useful than others and thus more likely to appear naturally again and again (and those patented sequences are more likely to be notable since they wouldn't be patented otherwise) but that's probably beyond my capabilities and you get the point.
There is only one way for this sequence to occur. For those bases to occur in that exact order. Where you make the cuts is irrelevant. That's the odds of that specific sequence occurring at any point. There's a 1 in 4 chance of having cytosine at any point in the genome. There's a 1 in 4 chance will be followed by thymine. There's a 1 in 4 chance that that will be followed by cytosine again, etc. What I gave was an absolute value; the odds of this specific sequence occurring anywhere.
Your calculation presupposes the genome length of the virus is exactly the length of the sequence, that's the mistake I'm pointing out. Think about it this way. Lets say I tell you that I've written down a random five letter word (each letter being completely random, that is). Now I ask you to calculate the odds of it being "horse". You would correctly say that that's one in 26*26*26*26*26. However let's say I give you a book comprised of random letters instead (the whole genome of the virus). Wouldn't you agree that the chance of it containing the word "horse" anywhere in it is much higher?
Now lets say I give you a whole dictionary of meaningful words and phrases (in our analogy that's all the patented genome sequences). I'm sure you understand that the probability of at least one word being contained in the book is even higher in that case. And now consider that in our problem the letters aren't random at all, because some genome sequences are more useful than others and the useful sequences are much more likely to be patented.
That's what makes calculating probabilities of such real world events so tricky. I'm not an expert either, and I wouldn't be surprised if my back-of-the-napkin calculation is quite off. But that's the sort of things you need to take into account to arrive at a reasonable conclusion.
I make no presupposition regarding the length of the genome. Read what I said again. Go to any point within the entire genome, you have a 25% chance of finding any particular nucleotide base. Pluck any string of 19 out of the genome; regardless of how long the entire genome is, this is the chance that it will match this sequence. That is the calculation I did. True, if there are multiple sets to choose from, that increases the odds of finding this sequence, but that isn't the calculation I did. I gave the chance that any single set of 19 matches. I never tried to represent it as anything else.
Quick math, that's 19 nucleotides with 4 possible bases at each location. 4^19 possible combinations. That's 1 in, wait for it...
274,877,906,944
EDIT: If that number looks big, it is. That's almost 275 BILLION.
But there's not just one way that sequence can occur. It could be at any point in the genome. A google search tells me the genome is 29.903 bases long, and if we assume it could occur at any point that's 274877906944/29903 = 9192318 (a bit less than one million).
But of course this isn't the only sequence that would be notable. We have to account at least for every patented sequence, because each of them would be newsworthy, right? This source claims that over 40K human DNA sequences have been patented. Now COVID isn't DNA nor is human, but we have to make some simplifying assumptions here since I couldn't find the data on patented RNA sequences.
So we have 9192318/40000= 229.8. We could bring that number even further down if we consider how some sequences are more useful than others and thus more likely to appear naturally again and again (and those patented sequences are more likely to be notable since they wouldn't be patented otherwise) but that's probably beyond my capabilities and you get the point.
There is only one way for this sequence to occur. For those bases to occur in that exact order. Where you make the cuts is irrelevant. That's the odds of that specific sequence occurring at any point. There's a 1 in 4 chance of having cytosine at any point in the genome. There's a 1 in 4 chance will be followed by thymine. There's a 1 in 4 chance that that will be followed by cytosine again, etc. What I gave was an absolute value; the odds of this specific sequence occurring anywhere.
Your calculation presupposes the genome length of the virus is exactly the length of the sequence, that's the mistake I'm pointing out. Think about it this way. Lets say I tell you that I've written down a random five letter word (each letter being completely random, that is). Now I ask you to calculate the odds of it being "horse". You would correctly say that that's one in 26*26*26*26*26. However let's say I give you a book comprised of random letters instead (the whole genome of the virus). Wouldn't you agree that the chance of it containing the word "horse" anywhere in it is much higher?
Now lets say I give you a whole dictionary of meaningful words and phrases (in our analogy that's all the patented genome sequences). I'm sure you understand that the probability of at least one word being contained in the book is even higher in that case. And now consider that in our problem the letters aren't random at all, because some genome sequences are more useful than others and the useful sequences are much more likely to be patented.
That's what makes calculating probabilities of such real world events so tricky. I'm not an expert either, and I wouldn't be surprised if my back-of-the-napkin calculation is quite off. But that's the sort of things you need to take into account to arrive at a reasonable conclusion.
I make no presupposition regarding the length of the genome. Read what I said again. Go to any point within the entire genome, you have a 25% chance of finding any particular nucleotide base. Pluck any string of 19 out of the genome; regardless of how long the entire genome is, this is the chance that it will match this sequence. That is the calculation I did. True, if there are multiple sets to choose from, that increases the odds of finding this sequence, but that isn't the calculation I did. I gave the chance that any single set of 19 matches. I never tried to represent it as anything else.