论文标题

利用数据准备,HBASE NOSQL存储和HIVEQL查询COVID-19大数据分析项目

Leveraging Data Preparation, HBase NoSQL Storage, and HiveQL Querying for COVID-19 Big Data Analytics Projects

论文作者

Baïna, Karim

论文摘要

流行病学家,科学家,统计学家,历史学家,数据工程师和数据科学家正在努力寻找描述性模型和理论,以解释COVID-19的扩展现象或建立分析的预测模型,以学习COVID-19的最佳案例,恢复病例和死亡进化曲线。在CRISP-DM生命周期中,只有数据准备阶段才能消耗75%的时间,从而引起大量的压力和对建立机器学习模型的科学家和数据科学家的压力。本文旨在通过介绍详细的模式设计和数据准备技术脚本来帮助减少数据准备工作,以在HBase NOSQL数据存储中进行格式化和存储Johns Hopkins University Covid-19的每日数据,并启用HiveQL COVID-COVID-COVID-19中的数据查询,以相关的Hive sql类似SQL样式。

Epidemiologist, Scientists, Statisticians, Historians, Data engineers and Data scientists are working on finding descriptive models and theories to explain COVID-19 expansion phenomena or on building analytics predictive models for learning the apex of COVID-19 confimed cases, recovered cases, and deaths evolution curves. In CRISP-DM life cycle, 75% of time is consumed only by data preparation phase causing lot of pressions and stress on scientists and data scientists building machine learning models. This paper aims to help reducing data preparation efforts by presenting detailed schemas design and data preparation technical scripts for formatting and storing Johns Hopkins University COVID-19 daily data in HBase NoSQL data store, and enabling HiveQL COVID-19 data querying in a relational Hive SQL-like style.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源