论文标题
实时数据探索环境的基础
Foundations of a live data exploration environment
论文作者
论文摘要
上下文:编写越来越多的代码来探索和分析数据,通常是由没有传统编程背景的数据分析师,例如记者。 查询:此类数据和写入代码的方式不同于软件工程师这样做的方式。他们使用的抽象很少,可以交互作用,并严重依赖外部库。我们旨在捕获这种工作方式和构建编程环境,通过提供即时的实时反馈来使数据探索更容易。 方法:我们结合了理论和应用方法。我们介绍\ emph {数据探索演算},这是一种捕获数据分析师编写的代码结构的正式语言。然后,我们实施一个数据探索环境,该环境在编辑过程中立即评估代码,并显示结果的预览。 知识:我们正式描述了一种用于为数据探索微积分提供即时预览的算法,该算法允许用户在文本编辑器中以不受限制的方式修改代码。支持交互式编辑非常棘手,因为任何编辑都可以更改代码的结构,并且完全重新计算输出将太昂贵。 接地:我们证明我们的算法是正确的,并且在许多常见的代码编辑操作后更新预览时,它会重复使用先前的结果。我们还通过经验评估和案例研究来说明方法的实用性。 重要性:随着数据分析变得越来越重要,对编程语言和工具的研究需要考虑适用于这些域的新型编程工作流程,并构想了可以支持它们的新型工具。本文是朝这个重要方向迈出的一步。
Context: A growing amount of code is written to explore and analyze data, often by data analysts who do not have a traditional background in programming, for example by journalists. Inquiry: The way such data anlysts write code is different from the way software engineers do so. They use few abstractions, work interactively and rely heavily on external libraries. We aim to capture this way of working and build a programming environment that makes data exploration easier by providing instant live feedback. Approach: We combine theoretical and applied approach. We present the \emph{data exploration calculus}, a formal language that captures the structure of code written by data analysts. We then implement a data exploration environment that evaluates code instantly during editing and shows previews of the results. Knowledge: We formally describe an algorithm for providing instant previews for the data exploration calculus that allows the user to modify code in an unrestricted way in a text editor. Supporting interactive editing is tricky as any edit can change the structure of code and fully recomputing the output would be too expensive. Grounding: We prove that our algorithm is correct and that it reuses previous results when updating previews after a number of common code edit operations. We also illustrate the practicality of our approach with an empirical evaluation and a case study. Importance: As data analysis becomes an ever more important use of programming, research on programming languages and tools needs to consider new kinds of programming workflows appropriate for those domains and conceive new kinds of tools that can support them. The present paper is one step in this important direction.