COSIME：基于FEFET的联想记忆，用于内存余弦相似性搜索

论文标题

COSIME：基于FEFET的联想记忆，用于内存余弦相似性搜索

COSIME: FeFET based Associative Memory for In-Memory Cosine Similarity Search

论文作者

Liu, Che-Kai, Chen, Haobang, Imani, Mohsen, Ni, Kai, Kazemi, Arman, Laguna, Ann Franchesca, Niemier, Michael, Hu, Xiaobo Sharon, Zhao, Liang, Zhuo, Cheng, Yin, Xunzhao

论文摘要

在许多机器学习模型中，在训练有素的类向量上搜索了输入查询，以找到余弦相似性指标中最接近的特征类向量。但是，在von-Neumann机器中执行向量之间的余弦相似性涉及大量乘法，欧几里得正常化和分裂操作，从而产生了重型硬件能量和潜伏期的开销。此外，由于传统体系结构中存在的记忆墙问题，基于余弦相似性的搜索（CSS）在类向量上需要大量数据移动，从而限制了系统的吞吐量和效率。为了克服上述挑战，本文介绍了Cosime，这是一种基于铁电FIT（FEFET）设备的一般内存关联内存（AM）发动机，用于有效的CSS。通过利用FEFET设备的单晶体管和栅极功能，基于电流的转换模拟电路和获奖者 - 击打所有（WTA）电路，Cosime可以在存储器块中的所有条目中实现平行的内存中CSS，并在Cosine相似度中输入最接近的输入查询。阵列级别的评估结果表明，所提出的COSIME设计分别达到了333倍和90.5倍的延迟和能量改进，并且与实现近似CSS的AM设计相比，实现了更好的分类准确性。针对HDC问题评估了拟议的内存计算结构，表明Cosim可以平均实现47.1倍和98.5倍的速度和98.5倍的速度和能源效率，而与GPU实施相比。

In a number of machine learning models, an input query is searched across the trained class vectors to find the closest feature class vector in cosine similarity metric. However, performing the cosine similarities between the vectors in Von-Neumann machines involves a large number of multiplications, Euclidean normalizations and division operations, thus incurring heavy hardware energy and latency overheads. Moreover, due to the memory wall problem that presents in the conventional architecture, frequent cosine similarity-based searches (CSSs) over the class vectors requires a lot of data movements, limiting the throughput and efficiency of the system. To overcome the aforementioned challenges, this paper introduces COSIME, an general in-memory associative memory (AM) engine based on the ferroelectric FET (FeFET) device for efficient CSS. By leveraging the one-transistor AND gate function of FeFET devices, current-based translinear analog circuit and winner-take-all (WTA) circuitry, COSIME can realize parallel in-memory CSS across all the entries in a memory block, and output the closest word to the input query in cosine similarity metric. Evaluation results at the array level suggest that the proposed COSIME design achieves 333X and 90.5X latency and energy improvements, respectively, and realizes better classification accuracy when compared with an AM design implementing approximated CSS. The proposed in-memory computing fabric is evaluated for an HDC problem, showcasing that COSIME can achieve on average 47.1X and 98.5X speedup and energy efficiency improvements compared with an GPU implementation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题