19 June 2025

Research pick: Hadoop, up, and away - "Optimisation of distributed storage technology for large-scale data based on Hadoop technology"

A new approach to the storage and processing of large-scale data, called Hadoop-OptiStor is discussed in the International Journal of Reasoning-based Intelligent Systems. The optimisation framework is built on top of the widely used open-source system Hadoop and could deliver a more responsive and efficient backbone for big data infrastructure.

Apache Hadoop is an open-source framework for scalable, distributed computing. It can store and process large datasets across computer clusters using the MapReduce programming model. It was originally developed for inexpensive, off-the-shelf machines, but is now used to support high-end systems. Importantly, Hadoop can handle hardware failures automatically, ensuring reliable operation in large-scale environments.

Despite its benefits, Hadoop’s traditional mechanisms, such as its approach to data replication for reliability and its task scheduling strategy, it has some limitations when it comes to huge data volumes. Processing delays, inefficient resource allocation, and data traffic bottlenecks are becoming more common in many real-world deployments.

Hadoop-OptiStor addresses these issues by reworking how Hadoop handles three core operations: data distribution, replica management, and task scheduling. The developers have carried out a practical deployment within a major internet company. Their new model achieved a 30% reduction in task execution time, a 20% decrease in system load, and a 40% increase in data throughput. These results reflect a more balanced use of computational resources and a notable gain in overall system efficiency.

Additionally, Hadoop-OptiStor uses machine learning to anticipate data access patterns and optimise scheduling. It can learn from historical usage data and make proactive adjustments, further reducing delays and inefficiencies.

Industries that rely on real-time data analysis, such as healthcare, finance, transportation, and energy, need infrastructure that can handle both the scale and speed of modern data streams. Hadoop-OptiStor offers a practical route to improving performance without making existing systems redundant. Its suitability for low-latency, high-throughput scenarios also makes it well-aligned with the needs of Internet of Things (IoT) environments, where devices continuously generate and respond to data.

Optimisation of distributed storage technology for large-scale data based on Hadoop technology’, Int. J. Reasoning-based Intelligent Systems, Vol. 17, No. 7, pp.11–20.

No comments: