SVM Computation in a Distributed Database

The so-called support vector machines (SVM) have become a very important tool for the classi cation problem. Computing an SVM amounts to solving a certain optimization problem. The SVM optimization problem is posed with respect to a set of labeled examples given explicitly. In real-life databases, the data is often distributed over various tables. Even if the data is given in a single table, there are often external sources of data that can improve the accuracy of a classifier if incorporated in the classifier. For example, a given table providing attributes of individuals that have to be classified may include the town where the individual resides but no attributes of that town. An external source may provide various attributes of towns or transaction that took place in various towns, which may be relevant to the classification of individuals. Thus, it is desirable to build a classifier that takes some of these attributes or transactions into account. This hypothesis calls for joining the tables on the town column.

To apply a standard SVM algorithm when attributes are distributed over tables, one has to first join the tables. However, joining tables explicitly may not be possible due to the size of the product. Thus, the question is whether it is possible to obtain an SVM for the join without generating the table explicitly. Here, we show how this can be done for the join of two tables. In general, the size of the join of two tables can be quadratic in the terms of the sizes of the joined tables.

By: Nimrod Megiddo

Published in: RJ10524 in 2014

rj10524.pdf

Questions about this service can be mailed to reports@us.ibm.com .