Big data is the latest Big Concept to sweep the business world. It's the collection of data sets so large -- scary large, as I like to think about it -- that it becomes difficult to process using traditional applications. Dedicated server farms are necessary to collect, store, and process through terabytes to petabytes of information to spot patterns, extract pertinent results, and everything else one might want.
Such crazy big data was once the sole playground of physics types when looking at simulations of nuclear weapons or analyzing the results generated by multi-billion dollar physics colliders. Weather and climate simulations were also a big favorite, with meteorologists pushing for greater fidelity to predict how much snow will fall and how long it will last.
Today, commerce is a big driver for big data. Walmart processes upwards of a million customer transactions every hour. Amazon and eBay both have huge data warehouses geared to search consumer recommendations and the all-important purchase process. FedEx doesn't exist in the form it is today without big data, keeping minute track of every package from door-to-door and figuring out the best way to optimize its delivery methods on a daily basis.
But the applications listed above have only begun to scratch the surface of what big data will be in the next decade. There are at least three fields that are going to add to the ranks of big data: health, voice, and video.
DNA sequencing continues to get faster and cheaper every year, in some respects handily beating Moore's Law for increased improvements over the past decade. The human genome takes up around 8 gigabytes of data. To isolate genetic diseases or risk factors among a population may require comparing DNA sets of 10,000 to 25,000 people. Multiple 8 GB by 25,000 and you get some very large number of servers, multi-petabyte disk arrays and building LANs that are hard-pressed to keep up at 10 Gbps when the serious analysis kicks in.
Voice, with a few exceptions, has been treated as disposable garbage over the past 15 years. Calls are made, people scribble a few notes which may (or may not) be accurate and then rounds of email are sent to clarify what was said and meant in a person-to-person or conference call.
Large call centers have been at the forefront of treating voice as information, by recording calls, translating speech into text and then pounding away at all the text with voice analytics packages to identify best (and worse) practices among customer agents, distill competitive intelligence from the daily inbound call flow and provide a contact record in case of customer disputes. Take the number of calls an agent handles per hour, multiple by the number of agent seats and then multiple again with the hours per year the contact center takes calls to get a whopping big amount of raw data that needs to be processed.
One spin on voice as information is a concept called HyperVoice. Take phone calls, conference calls, speeches, the audio track of videos and index everything into small sound bites, searchable by key word. You could look through all the speeches and presentations of Steve Jobs and see where he talks about brilliant architecture, then cross-reference it to architectural presentations made by anyone else in the world. It would essentially be Google for the spoken word, regardless of the medium.
Finally, video and imagery are going to get a good shot in the arm with the introduction of 4K UHD TVs in the consumer world and small satellites on the other. The consumer electronics industry needs the next big thing to move TV sets. With the 3-D effort failed, UHD TV is the only game in town, so there's going to be a cycle of filming more content at higher resolution, followed by distributing it via broadcast and broadband.
Two Silicon Valley companies -- Planet Labs and Skybox Imaging -- are in the process of building a "cloud" of satellites to take pictures of the Earth's surface on a daily basis. Planet Labs is in the process of orbiting around 32 small satellites taking picture with resolution of 3-5 meters while Skybox will sell pictures around 1 meter using a constellation of 24 or so satellites.
Both companies are going to gather massive amounts of imagery and video over time, allowing customers to observe changes in Earth, the daily movement of traffic on the land and sea and be able to tell the difference between cars and truck traffic, for example.
Regardless of the type of data, any "Big data" center will need the fastest LAN connections possible. The starting point should be all fiber, 10 GigE connections at the very least with 100 GigE -- or faster -- preferable when the data sets start moving around.