论文标题

大数据处理中算法的比较

Comparisons of Algorithms in Big Data Processing

论文作者

Daghighi, Amirali, Chen, Jim Q.

论文摘要

并行计算是Hadoop中MapReduce框架的基本基础。每个数据块在3个服务器上复制,以增加数据的可用性并降低数据丢失的概率。因此,将存储在磁盘上的MAP任务的3个服务器是处理它们的最快的服务器,称为本地服务器。与本地服务器相同的机架中的所有服务器都称为RACK-LOCAL服务器,它们比本地服务器慢,因为与MAP任务相关的数据块应通过机架开关的顶部获取。所有其他服务器都称为远程服务器,它是最慢的服务器,因为它们需要从另一个机架中的本地服务器获取数据,因此应通过机架开关的至少2个顶部和核心开关来传输数据。请注意,数据传输路径中的开关数取决于数据中心的内部网络结构。第一个第一(FIFO)和Hadoop Fair Scheduler(HFS)算法不考虑数据中心的机架结构,因此众所周知,它们不会成为最佳甚至最佳吞吐量的重型延迟。考虑其机架结构及服务器异质性的数据中心安排进展的最新进展导致最先进的平衡pandas算法优于经典的MaxWeight算法。在平衡pandas和MaxWeight算法中,假定本地,机架 - 本地和远程服务器的处理速率已知。但是,随着流量随时间变化的变化,除了处理速率的估计错误之外,考虑要知道的处理率是不现实的。在这项工作中,我们研究了处理速率的估计,研究平衡pandas和MaxWeight算法的鲁棒性。我们观察到,平衡 - 帕达斯对处理率的准确性不像MaxWeight那样敏感,这使得在数据中心使用更具吸引力。

Parallel computing is the fundamental base for MapReduce framework in Hadoop. Each data chunk is replicated over 3 servers for increasing availability of data and decreasing probability of data loss. Hence, the 3 servers that have Map task stored on their disk are fastest servers to process them, which are called local servers. All servers in the same rack as local servers are called rack-local servers that are slower than local servers since data chunk associated with Map task should be fetched through top of the rack switch. All other servers are called remote servers that are slowest servers since they need to fetch data from a local server in another rack, so data should be transmitted through at least 2 top of rack switches and a core switch. Note that number of switches in path of data transfer depends on internal network structure of data centers. The First-In-First-Out (FIFO) and Hadoop Fair Scheduler (HFS) algorithms do not take rack structure of data centers into account, so they are known to not be heavy-traffic delay optimal or even throughput optimal. The recent advances on scheduling for data centers considering rack structure of them and heterogeneity of servers resulted in state-of-the-art Balanced-PANDAS algorithm that outperforms classic MaxWeight algorithm. In both Balanced-PANDAS and MaxWeight algorithms, processing rate of local, rack-local, and remote servers are assumed to be known. However, with the change of traffic over time in addition to estimation errors of processing rates, it is not realistic to consider processing rates to be known. In this work, we study robustness of Balanced-PANDAS and MaxWeight algorithms in terms of inaccurate estimations of processing rates. We observe that Balanced-PANDAS is not as sensitive as MaxWeight on the accuracy of processing rates, making it more appealing to use in data centers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源