论文标题

优雅地忘记了II。数据作为一个过程

Graceful Forgetting II. Data as a Process

论文作者

de Cheveigné, Alain

论文摘要

数据的规模和重要性迅速增长,这是由于它们的能力所激发的趋势。通过技术进步所维持的新数据的积累导致了存储数据的无限扩展,在某些情况下,应计率本身的指数增加。大量数据很难处理,传输,存储和利用,并且与整个数据存储保持一致。本文区分了数据生活中的三个阶段:获取,策展和剥削。每个过程都涉及一个独特的过程,该过程可能会在时间上与其他过程分开,并具有不同的优先级。第二阶段策划的功能是最大化给定存储的数据的未来值。我认为这要求(a)数据采用摘要统计的形式,(b)这些统计数据遵循无尽的重新恢复过程。摘要可能比原始数据更紧凑,但是其数据结构更为复杂,并且需要一个持续的计算过程,该过程比单纯的存储更复杂。重新制定会导致降低维度的降低,这可能对学习有益,但必须仔细控制以保持相关性。可以根据用法的反馈来调整重新制定,而我们对过去的记忆是为未来服务的条件,其需求尚不完全了解。

Data are rapidly growing in size and importance for society, a trend motivated by their enabling power. The accumulation of new data, sustained by progress in technology, leads to a boundless expansion of stored data, in some cases with an exponential increase in the accrual rate itself. Massive data are hard to process, transmit, store, and exploit, and it is particularly hard to keep abreast of the data store as a whole. This paper distinguishes three phases in the life of data: acquisition, curation, and exploitation. Each involves a distinct process, that may be separated from the others in time, with a different set of priorities. The function of the second phase, curation, is to maximize the future value of the data given limited storage. I argue that this requires that (a) the data take the form of summary statistics and (b) these statistics follow an endless process of rescaling. The summary may be more compact than the original data, but its data structure is more complex and it requires an on-going computational process that is much more sophisticated than mere storage. Rescaling results in dimensionality reduction that may be beneficial for learning, but that must be carefully controlled to preserve relevance. Rescaling may be tuned based on feedback from usage, with the proviso that our memory of the past serves the future, the needs of which are not fully known.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源