About 36% of organizations indicate they have or will have a big data initiative by the end of 2013. This means that more than a third of the organizations polled effectively have told Nemertes: “Our data growth rates were bad before, but we expect them to be much worse as we go forward.” We heard this from companies in many industries, including retail, media, financial services, and manufacturing.
The data comes from all sorts of sources, including:
- Traditional record-oriented relational databases
- Audio, image, or video objects
- Well-structured text data such as XML documents
- Real-time sources like sensor data feeds
- Messy, partially structured, text-based data from social media, office tools, or other sources.
Consider these examples:
- Shifting to digital imaging in a hospital can change a 1-kilobyte set of chart notes on an X-ray into a 100,000-kilobyte digital X-ray image plus the accompanying map-linked annotations. They will be used immediately, but also analyzed and re-analyzed over time for research and diagnostic purposes, and retained past the patients’ deaths.
- Production data in a factory, flowing from sensors on each production line, may begin to flow into a big data environment at a rate of 1 kilobyte per sensor per second for hundreds or thousands of sensors, and they will not only be folded into analysis continuously but will be retained for the next seven years before being deleted.
Clearly, as the lid comes off and additional larger types of data pour in from more sources, IT needs to address the issues of information stewardship (IS) with some urgency. IS means putting policy in place to guide the storage, management, and protection of each byte of data from acquisition through to final disposition (archiving or deletion, that is).
No matter how good you are at managing storage and matching technologies to needs, infinite storage is not possible in the confines of the data center (or anywhere else, for that matter, but the limits are a lot closer in a given data center!). So, IT has to extend its approach to IS to include making appropriate use of cloud resources to supplement or replace in-house resources. Certainly, lots of CIOs tell us they are looking hard at cloud as one possible solution to long-term storage for data that cannot simply be deleted or archived. Some are even using it or considering it for primary storage in a big data analytics environment (along with cloud compute for the actual analysis). What many of them still lack, however, is the detailed information they require to make a fully informed choice on the matter. Anyone can compare the base acquisition costs of in-house disk storage to the cost of getting block storage in a cloud. Few have complete and detailed data on the actual cost per terabyte of owning and operating storage once it is on board.
The bottom line is that IT is in a bind. It can clearly see the writing on the wall with respect to keeping all storage in-house, but it lacks the understanding of its own cost structures needed to know when and what to move into the cloud. Even while exploring and piloting use of cloud resources, it has to be building this crucial knowledge to be ready for what comes next.
Does your business have a big data initiative? What are your data management priorities? What resources do you need to manage your data more effectively? Tell us in the comments below!
John Burke is a Principle Research Analyst at Nemertes Research. He has written this guest post for the Networking Exchange Blog.