Ensemble Learning with Annealing (QBoost)

Ensemble Learning with Annealing (QBoost)#

QBoost is one of the ensemble learning with quantum annealing(QA). Ensemble learning is a method that prepares a large number of weak predictors and combines the predictions of each of the predictors to obtain a final accurate prediction result.

Overview and Principle#

QBoost is an algorithm whose goal is to accurately identify the nature of the input signal $\boldsymbol{x}$ . Let us consider the problem of which of the two values $\pm 1$ should be assigned to the input signal. As an example, imagine a task where $\boldsymbol{x}$ represents image data and we want to identify whether what we see in the image is a dog or a cat. In ensemble learning, the goal is to achieve better prediction accuracy (boosting) by using multiple predictors. Here, we have many predictors that do not perform well (weak predictors). Note that by poorly performing, it means that they often do not produce correct outputs for their inputs. Let the output of these predictors be $c_i (\boldsymbol{x}) \in \{ -1, 1\} \ (i=0, 1, \dots, N-1)$ . The basic idea is that by summing the outputs of several weak predictors, we can make better predictions. This can be expressed as:

(1)#

C(\boldsymbol{x}) = \mathrm{sgn} \left( \sum_{i=0}^{N-1} w_i c_i (\boldsymbol{x}) \right)

$w_i \in \{0, 1\}$ indicates whether the $i$ th predictor is used or not. Let us identify which predictor gives better performance with as few weak predictors as possible. We will use supervised learning to find the optimal $\{w_i\}$ pairs. We have a large number of $(\boldsymbol{x}^{(d)}, y^{(d)}) \ (d= 0, 1, \dots, D-1)$ of supervised data ( $D \gg 1$ ), and we adjust $\{w_i\}$ to reproduce them as closely as possible. In other words, we aim to minimize the following Hamiltonian for $\{w_i\}$ :

(2)#

H(\boldsymbol{w}) = \sum_{d=0}^{D-1} \left( \frac{1}{N} \sum_{i=0}^{N-1} w_i c_i (\boldsymbol{x}^{(d)}) - y^{(d)}\right)^2 + \lambda \sum_{i=0}^{N-1} w_i

Through this minimization of the Hamiltonian, the difference from the supervised data $y^{(d)}$ is made as small as possible. If we use the right-hand side of equation (1) as it is, it cannot come down to the Ising model because it is not a quadratic form of $w_i$ due to the sign function. Therefore, the problem is made to minimize the square of the difference between the $1/N$ times the sign function argument $\sum_i w_i c_i$ and the supervised data $y^{(d)}$ . The coefficient of $1/N$ is to adjust the maximum value of $\sum_i w_i c_i(\boldsymbol{x})$ to be $N$ so that the difference between $y^{(d)}= \pm 1$ is not too large. The term with $\lambda (>0)$ applied represents the regularization term to efficiently construct a relatively small number of weak predictors without setting too many $w_i$ to 1.

Modeling by JijModeling#

Variables#

Let us define the variables used in equation (2) as follows:

import jijmodeling as jm

# set problem
problem = jm.Problem('QBoost')
# define variables
c = jm.Placeholder('c', dim=2)
N = c.shape[0].set_latex('N')
D = c.shape[1].set_latex('D')
w = jm.Binary('w', shape=(N))
y = jm.Placeholder('y', shape=(D))
lam = jm.Placeholder('lam')
i = jm.Element('i', (0, N))
d = jm.Element('d', (0, D))

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import jijmodeling as jm
      3 # set problem
      4 problem = jm.Problem('QBoost')

ModuleNotFoundError: No module named 'jijmodeling'

c = jm.Placeholder('c', dim=2) defines $c$ in equation (2). From the size of that list, we define the number of weak predictors $N$ and the number of supervised data $D$ as N, D, respectively. Using them, we define the binary variable w and the binary value y of the supervised data to be used for optimization. We define $\lambda$ in equation (2) as lam and the subscripts i, d, used in equation (2).

Objective Function#

Let us implement equation (2). We start with the first term.

# set objective function 1: minimize the difference
sum_i = jm.Sum(i, w[i]*c[i, d]) / N
problem += jm.Sum(d, (sum_i-y[d])**2)

We write $\sum_{i=0}^{N-1} w_i c_i / N$ as sum_i. We also implement the second term as an objective function. Note that w[:] implements $\sum_i w_i$ in a concise way.

# set objective function 2: minimize the number of weak classifiers
problem += lam * w[:]
problem

\begin{alignat*}{4}\text{Problem} & \text{: QBoost} \\\min & \quad \sum_{ d = 0 }^{ D - 1 } \left( \frac{ \sum_{ i = 0 }^{ N - 1 } w_{i} \cdot c_{i,d} }{ N } - y_{d} \right) ^ { 2 } + lam \cdot \sum_{ \bar{i}_{0} = 0 }^{ N - 1 } w_{\bar{i}_{0}} \\& w_{i_{0}} \in \{0, 1\}\end{alignat*}

Instance#

Let us set up the task to be executed. We use the decision stump (decision stock: one-layer decision tree) from scikit-learn as the weak predictor. We also use the scikit-learn cancer identification dataset.

import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier as DTC

def prediction_from_train(N, X_train, y_train, X_test):
    # set the number of ensembles to be taken out for one sample in the bootstrap sampling
    sample_train = 40
    # set model
    models = [DTC(splitter="random", max_depth=1) for i in range(N)]
    for model in models:
        # extract randomly
        train_idx = np.random.choice(np.arange(X_train.shape[0]), sample_train)
        # make decision tree with variables
        model.fit(X=X_train[train_idx], y=y_train[train_idx])
    y_pred_list_train = []
    for model in models:
        # execute prediction with model
        y_pred_list_train.append(model.predict(X_train))
    y_pred_list_train = np.asanyarray(y_pred_list_train)
    y_pred_list_test = []
    for model in models:
        # execute with test data
        y_pred_list_test.append(model.predict(X_test))
    y_pred_list_test = np.array(y_pred_list_test)
    return y_pred_list_train, y_pred_list_test

# load data
cancer_data = datasets.load_breast_cancer()
# set the number of train data
num_train = 200
# add noise to feature
noisy_data = np.concatenate((cancer_data.data, np.random.rand(cancer_data.data.shape[0], 30)), axis=1)
# convert (0, 1) label to (-1, 1)
labels = (cancer_data.target-0.5) * 2
# divide the dataset to train and test
X_train = noisy_data[:num_train, :]
X_test = noisy_data[num_train:, :]
y_train = labels[:num_train]
y_test = labels[num_train:]
# set the number of classifiers
N = 20
# predict from train data using decision tree classifier
y_pred_list_train, y_pred_list_test = prediction_from_train(N, X_train, y_train, X_test)
# set lambda (coefficient of 2nd term)
lam = 3.0
instance_data = {'y': y_train, 'c': y_pred_list_train, 'lam': lam, 'y_train': y_train, 'y_test': y_test, 'y_pred_list_test': y_pred_list_test}

For demonstration purposes, we use the data with the addition of noisy features. The prediction_from_train function is used to create weak predictors and the output $c_i (\boldsymbol{x}^{(d)})$ from those predictors. Here the number of weak predictors is 20 and the number of supervised data is 200. The value of $\lambda$ in equation (2) is set to 3.0.

Undetermined Multiplier#

Since there are no constraints, the dictionary to set the undefined multiplier is left empty.

# set multipliers
multipliers = {}

Conversion to PyQUBO by JijModeling Transpiler#

JijModeling has executed all the implementations so far. By converting this to PyQUBO, it is possible to perform combinatorial optimization calculations using OpenJij and other solvers.

from jijmodeling.transpiler.pyqubo import to_pyqubo

# convert to pyqubo
pyq_model, pyq_cache = to_pyqubo(problem, instance_data, {})
qubo, bias = pyq_model.compile().to_qubo(feed_dict=multipliers)

The PyQUBO model is created by to_pyqubo with the problem created by JijModeling and the instance_data we set to a value as arguments. Next, we compile it into a QUBO model that can be computed by OpenJij or other solver.

Optimization by OpenJij#

This time, we will use OpenJij’s simulated annealing to solve the optimization problem. We set the SASampler and input the qubo of the QUBO model into that sampler to get the result of the calculation.

import openjij as oj

# set sampler
sampler = oj.SASampler()
# solve problem
response = sampler.sample_qubo(qubo, num_reads=5)

Decoding and Displaying the Solution#

Decode the returned results to facilitate analysis.

# decode solution
result = pyq_cache.decode(response)

From the weak predictors selected by simulated annealing, let us look at the classification accuracy of the test data.

from sklearn import metrics

y_pred_list_train = instance_data['c']
y_pred_list_test = instance_data['y_pred_list_test']
y_train = instance_data['y_train']
y_test = instance_data['y_test']
accs_train_oj = []
accs_test_oj = []
for solution in result.record.solution['w']:
    idx_clf_oj = solution[0][0]  # [0, 1, 4, 5, 6, 7, 8, 11, 13, 14, 15, 16, 17, 18, 19]
    y_pred_train_oj = np.sign(np.sum(y_pred_list_train[idx_clf_oj, :], axis=0))
    y_pred_test_oj = np.sign(np.sum(y_pred_list_test[idx_clf_oj, :], axis=0))
    acc_train_oj = metrics.accuracy_score(y_true=y_train, y_pred=y_pred_train_oj)
    acc_test_oj = metrics.accuracy_score(y_true=y_test, y_pred=y_pred_test_oj)
    accs_train_oj.append(acc_train_oj)
    accs_test_oj.append(acc_test_oj)    
print('Accuracy of QBoost is {}'.format(max(accs_test_oj)))

Accuracy of QBoost is 0.9024390243902439

We also check the average accuracy for the all 20 weak predictors.

accs = []
for i in range(20):
    y_pred_test = np.sign(np.sum(y_pred_list_test[[i], :], axis=0))
    acc = metrics.accuracy_score(y_true=y_test, y_pred=y_pred_test)
    accs.append(acc)
accs_ave = sum(accs) / 20
print('Average accuracy of all 20 weak predictors is {}'.format(accs_ave))

Average accuracy of all 20 weak predictors is 0.8269647696476966

The classification accuracy on the test data is better than using all of the weak predictors.