论文标题

将数据集映射到对象存储系统

Mapping Datasets to Object Storage System

论文作者

Xiaowei, Chu, LeFevre, Jeff, Montana, Aldrin, Robinson, Dana, Koziol, Quincey, Alvaro, Peter, Maltzahn, Carlos

论文摘要

访问root和HDF5等访问库允许用户使用高级抽象(例如坐标系统和相关的切片操作)与数据集进行交互。不幸的是,访问库的实现是基于关于存储系统界面的过时的假设,通常无法完全从现代快速存储设备中受益。随着迅速发展的存储设备(例如非易失性内存和更大的数据集),情况越来越恶化。该项目探索了分布式的数据集映射基础架构,这些基础架构可以使用CEPH的可扩展对象模型集成和扩展现有的访问库,从而尽可能避免重新实现甚至对这些访问库的修改。这些可编程的存储扩展名与我们的分布式数据集映射技术相结合:1)访问库操作要卸载到存储系统服务器中,2)访问库和存储系统的独立演变以及3)完全利用了现有的负载平衡,弹性,以及像Ceph这样的分布式存储系统的失败管理系统。它们还创造了更多的机会来进行存储服务器 - 本地优化,该优化针对存储服务器的特定机会。例如,存储服务器可能包括本地密钥/值商店与需要优化不同于本地文件系统的块商店的本地密钥/值商店。随着存储服务器的发展以支持新的存储设备,例如非挥发性内存,可以在最大程度地减少对应用程序的破坏时实现这些服务器本地优化。我们将在特定访问库中抽象分布式数据集映射的方式报告进度,包括用于根数据的访问库,以及我们如何解决围绕数据分配和访问操作的合成性围绕的一些挑战。

Access libraries such as ROOT and HDF5 allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph's extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset mapping can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源