论文标题
OCCAMS剃须刀大数据?关于在大型非结构化数据集中检测质量
Occams Razor for Big Data? On Detecting Quality in Large Unstructured Datasets
论文作者
论文摘要
检测大型非结构化数据集中的质量要求能力远远超出人类感知和通讯的限制,因此,数据科学中日益复杂的分析解决方案有一个新兴的趋势,以解决这个问题。对分析复杂性的这种新趋势代表了对简约原则或科学剃须刀原则的严重挑战。这篇评论文章结合了来自物理学,计算科学,数据工程和认知科学等各个领域的见解,以回顾大数据的特定属性。然后,基于特定示例,突出显示了检测数据质量而不会失去简约原则的问题。数据群集的计算构建块方法可以在最小化的计算时间中处理大型非结构化数据集,并且可以通过相对简单的无监督的机器学习算法从大量非结构化图像或视频数据中快速提取含义。为什么我们仍然缺乏明智地利用大数据以针对特定任务提取相关信息,识别模式,生成新信息或存储并进一步处理大量传感器数据的相关信息的原因;示例说明了为什么我们需要主观观点和务实的方法来分析大数据内容。审查得出的结论是,东方之间的文化差异如何影响大数据分析的过程,以及旨在应对不久的将来应对大数据洪水的越来越自主的人工智能的发展。
Detecting quality in large unstructured datasets requires capacities far beyond the limits of human perception and communicability and, as a result, there is an emerging trend towards increasingly complex analytic solutions in data science to cope with this problem. This new trend towards analytic complexity represents a severe challenge for the principle of parsimony or Occams Razor in science. This review article combines insight from various domains such as physics, computational science, data engineering, and cognitive science to review the specific properties of big data. Problems for detecting data quality without losing the principle of parsimony are then highlighted on the basis of specific examples. Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time, and meaning can be extracted rapidly from large sets of unstructured image or video data parsimoniously through relatively simple unsupervised machine learning algorithms. Why we still massively lack in expertise for exploiting big data wisely to extract relevant information for specific tasks, recognize patterns, generate new information, or store and further process large amounts of sensor data is then reviewed; examples illustrating why we need subjective views and pragmatic methods to analyze big data contents are brought forward. The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics, and the development of increasingly autonomous artificial intelligence aimed at coping with the big data deluge in the near future.