论文标题
关于内核化和计算无根协议森林的思考
Reflections on kernelizing and computing unrooted agreement forests
论文作者
论文摘要
系统发育树是用于建模物种演化的叶片标记树。在这里,我们探讨了内核化(即减少数据)对计算两条未根的二元系统发育树之间TBR距离的NP硬性问题的实际影响。在文献中,这个问题是最大的一致性森林问题,在文献中是更加闻名的,在该问题中,将两棵树划分为最少数量的常见,非重叠的子树。我们已经实施了两个众所周知的还原规则,即缩小和链条减少,以及五个最近的五个理论上更强的减少规则,并比较了有和没有更强大规则的减少。我们发现新规则产生了较小的实例,因此具有明显的实际附加值。在许多情况下,它们还会导致TBR距离以受控方式减小。接下来,我们将所达到的减少与已知的最差理论界限分别比较了15k-9和11k-9的最差理论界限,这是两个还原树的叶子的数量,其中k是TBR距离,在这两种情况下都观察到了实践的降低。作为我们实验框架的副产品,我们获得了对TBR距离实际计算的许多新见解。例如,我们发现,通过随机对叶标签的某些精心构造的分区进行随机采样,可以有效地获得TBR距离上非常强的下限,并确定似乎特别具有挑战性的实例。还原规则已在我们的新求解器Tubro中实施,该求解器将内核化与整数线性编程(ILP)方法相结合。 Tubro还结合了许多其他功能,例如群集降低和实用的上限启发式,它可以利用从还原规则的正确性证明来简化ILP。
Phylogenetic trees are leaf-labelled trees used to model the evolution of species. Here we explore the practical impact of kernelization (i.e. data reduction) on the NP-hard problem of computing the TBR distance between two unrooted binary phylogenetic trees. This problem is better-known in the literature as the maximum agreement forest problem, where the goal is to partition the two trees into a minimum number of common, non-overlapping subtrees. We have implemented two well-known reduction rules, the subtree and chain reduction, and five more recent, theoretically stronger reduction rules, and compare the reduction achieved with and without the stronger rules. We find that the new rules yield smaller reduced instances and thus have clear practical added value. In many cases they also cause the TBR distance to decrease in a controlled fashion. Next, we compare the achieved reduction to the known worst-case theoretical bounds of 15k-9 and 11k-9 respectively, on the number of leaves of the two reduced trees, where k is the TBR distance, observing in both cases a far larger reduction in practice. As a by-product of our experimental framework we obtain a number of new insights into the actual computation of TBR distance. We find, for example, that very strong lower bounds on TBR distance can be obtained efficiently by randomly sampling certain carefully constructed partitions of the leaf labels, and identify instances which seem particularly challenging to solve exactly. The reduction rules have been implemented within our new solver Tubro which combines kernelization with an Integer Linear Programming (ILP) approach. Tubro also incorporates a number of additional features, such as a cluster reduction and a practical upper-bounding heuristic, and it can leverage combinatorial insights emerging from the proofs of correctness of the reduction rules to simplify the ILP.