Random Article

randomly select again >>

The Pathologies of Big Data

Abstract
Scale up your datasets enough and all your apps will come undone. What are the typical problems and where do the bottlenecks generally surface?

Comments
An account of issues inherent to big data from an engineering perspective. Informed of constraints in computer hardware and search procedures, the author makes some back-of-the-envelope calculations with simple code examples to make his point: It is easier to get data into a database than out of a database. The pathologies of big data are primarily those of analysis (transaction and storage being ‘solved’). What makes big data big, according to the author, are repeated observations of time and space. The prevailing database model of today, the relational database, ignores ordering of rows in tables, and this becomes an issue when data outgrows memory space. As datasets grow, algorithms that exploit the efficiency of sequential access become more important. Random access carries a high computational cost. A proposed strategy is distributed computing, distributing analysis (storage already is distributed) over multiple computer systems.  The organization of data must then take the distributed state into account if it is to operate efficiently. Finally, the author offers a definition of big data: data whose size forces one to look beyond tried and true methods prevalent at the time.

Citation
Jacobs, A., “The Pathologies of Big Data,” ACM Queue: Databases, Jul. 6 2009, p. 1-12

Expertise Level
Introductory

Professional Field
Information Science

Link to Document