Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nice to see somebody pointing out the obvious bullshit regarding the Utah facility's real purpose (content storage, not metadata).

What kind of iron do you need to run NLP on a day's worth of content (let's limit "content" to voice transcriptions from phone calls)? Suppose I want to pick out "bomb" in near real-time? How many $billions for total information awareness?

Even the best algo is going to produce false positives. At nation scale, that is going to be a massive QA/QC effort.



I'm curious about the storage media they might have... is it just plain old rotational harddrives? Who made them -- Seagate? Western Digital? It's curious that no word has come out from that front -- "wow, this customer is requesting a _LOT_ of storage devices... like, more storage space than what is available in the world".

And really, because of this I think the idea that they might have some in-house innovation is not so far-fetched. We started hearing a lot about inPhase Technologies and such about DVD-sized discs that could store upwards of 6 terabytes back almost 10 years ago, but it amounted mostly to vapourware. We do now have confirmed information from Seagate that they'll start shipping out laptop-sized 2.5" rotational drives that will be able to store around ~60 TB of data within the next 2-3 years. Perhaps NSA has been secretively working with them, if not just producing the devices for themselves?


Uncompressed, 1 hour of phone audio is only ~29MB (8KB/sec*3600s/h). Compressing it for storage can send that way down. Let's assume to 6MB. If every 300M Americans talked for an hour a day, that's only 2TB a day for call audio.

Edit: That's only one-way, so double. But compression can eliminate most of that, as there's usually only audio on one side of a call at a time. Anyways, even if I'm right within a factor of 10 or so, it doesn't really seem like a suspiciously high volume of disks.

(Disclaimer: I'm very drowsy just been woken up due to a datacenter coolant failure so maybe I miscalculated it.)


Wouldn't the information be more useful in text form? Transcripts would take far less space and would probably be necessary for any sort of useful search. They might even call the transcript "meta-data".


Exactly, just speech-to-text the whole thing and you're done. Bonus: it's a lot speedier to search through.

Since you don't have to justify yourself in court, you're never going to have to submit the actual audio to any judge as evidence, so why would you keep it?


Unless the NSA also has speech to text that is decades ahead of everyone else, AND works perfectly well in ~50 languages of interest - because of transcription quality.


Not necessarily - the speech to text just has to be good enough to flag likely use of interesting terms. Those calls could then be fully stored for later analysis by an investigation if it becomes necessary. The vast majority of calls would end up being stored as a text file, but this technique with today's technology would certainly be good enough to flag a reasonably high percentage of calls of interest for audio storage. That phone call from Jimmy to Sally Mae telling her he's going to be late home from work? It doesn't matter if NSA-Siri garbles the translation...


You're off by three orders of magnitude.

6 MB * 300 million = 1716 TB


You're right and that makes far more sense. I apologize for this idiotic calculation. I should have known better because just capturing call signalling on one company's network, was taking 1TB compressed a day. To be fair I did add a disclaimer :\.


You're right, but that's only the calls one way. It should really be times by 2.

http://www.wolframalpha.com/input/?i=6+MB+*+300+million


See my comment about two-way. It wouldn't matter much as it's rare that both parties are speaking at the same time.

It's also possible that the compression techniques for long-term storage are vastly superior to realtime codecs. The lowest realtime voice codecs are 300-600 bits per second (they sound like shit), which is 213x compression (so an hour would be under a a meg).

http://www.wolframalpha.com/input/?i=600bps*1h*300000000

81TB a day. Again, this is assuming one hour of calls for 300 million people.

I did a quick search and found this snippet: "A telephia survey said that Americans average 13 talking hours a month – with the 18-24 age group averaging 22 hours."[1]

So that is under half an hour a day average. So, let's assume 300bps (lowest realtime voice codec I'm aware of), half hour a day, I'll stick to 300M people and we get:

http://www.wolframalpha.com/input/?i=300bps*30minutes*300000...

20TB.

So maybe I was only an order of magnitude off. Still pretty sloppy of me.

1: http://www.accuconference.com/blog/Cell-Phone-Statistics.asp...


Actually, it should probably be the logarithm of the population. The calls don't need to be recorded twice.


There are conspiracy theories out there that the whole flooding of the facilities which lead to supply shortage was just a coverup for NSA's order. I don't buy into such FAD, but until a few months ago I wouldn't have bought into the whole NSA thing either. So who knows.


I used to work in Western Digital. This is possibly the easiest conspiracy theory to debunk, since anyone who lived in Bangkok can confirm that the floods were very much real.


The idea was not that they made up the floods, the theory is they made up the supply shortage. They used the floods as a cover up. Now as you said, you worked for WD, and maybe you can prove that their production really did take a huge hit, if yes then well and good, as I said, I am not a fan of the theory either.


Ok, that made slightly more sense. But still, all WD drives are assembled either in Thailand or in Malaysia. Something like 2/3 of all WD drives are made in Thailand. During the floods, production in Thailand was shutdown completely for a month or so. The supply disruption was very much real.


I am sure a lot of it is tape. IBM sells a 900 PB tape library [1]. They could store the actual phone calls on tape, and then only put the necessary metadata for finding it on hard drives.

  [1] http://www-03.ibm.com/systems/storage/tape/ts3500/


That's good PR, too: you can tell the press that "the only thing on our hard drives is metadata" without (technically) lying.


In 1977, NSA were planning to have 157 GB of storage: http://cryptome.org/2013/03/cryptologs/cryptolog_28.pdf


Forget statements from Seagate or WD. Just look at their bottom line.


The NSA has quite a bit of in-house hardware manufacturing capability, but I don't know if they use it to make storage media.


The NSA probably purchases hardware from multiple vendors, through multiple fronts.


"How much storage" is an interesting question, but it says almost nothing about feasibility. If you have good-enough word-spotting as a front-end to storage, you can target your storage and analysis in a way that takes multiple orders of magnitude off your storage requirements. Then storage becomes like a "surveillance TiVO:" It saves the threads of conversations you want to watch, plus some historical buffer to cover the time it takes to make decisions about what you want to watch.



I might have to scrape this...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: