Hadoop

Abstract The Apache Hadoop  project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering [...]


MACOSPOL (Mapping Controversies on Science for Politics)

Abstract In modern societies, collective life is assembled through the superposition of scientific and technical controversies. The inequities of growth, the ecological crisis, the bioethical dilemma and all other major contemporary issues occur today as tangles of humans and non-humans actors, politics and science, morality and technology. Because of this growing hybridization complexity, getting involved [...]


Top 10 algorithms in data mining

Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we [...]


Privacy-Preserving Data Mining

Abstract A fruitful direction for future data mining research will be the development of techniques that incorporate privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We [...]


The Pathologies of Big Data

Abstract Scale up your datasets enough and all your apps will come undone. What are the typical problems and where do the bottlenecks generally surface? Comments An account of issues inherent to big data from an engineering perspective. Informed of constraints in computer hardware and search procedures, the author makes some back-of-the-envelope calculations with simple [...]


Live Linked Open Sensor Database

Abstract There are millions of sensors being deployed all over the world. Data generated by these sensors is provided in different formats and interfaces and is rarely associated with semantics that describe its meaning. The heterogeneity and lack of semantic descriptions pose a big barrier for accessing sensor data and combining it with other data [...]