Quantum is a Latin word and in physics it means the smallest possible discrete unit of any physical entity such as energy or mass. Quantic particles exhibit wave-particle duality and quantum theory deals with finding the probability of a quantum particle at a given point in space, while classical particles can be found at an exact location. Classical computers use bits that are either 0 or 1. But quantum computers use quantum bits called qubits. One of the peculiar properties of a qubit is that it can be in a superposition of both states and . This ability to store quantum states simultaneously is the base for quantum parallelism and offers exponentially compact representation of data. In addition to superposition, entanglement and interference of quantum states are also resources that provide the speedup for quantum computing as classical computers can do none like these.
Machine learning has found wide applications in many different areas in life today and often outperforms humans in accuracy in solving complex problems. Google’s AlphaGo is a very good example that attracts the world’s attention and generates unprecedented excitement to the AI research and development and brings hope that AI one day could really improve the quality of human life. However, certain application domains remain out of reach due to the difficulty of the problems and limitations of classical computers. In recent years, quantum machine learning has become a matter of interest because of its potential as a possible solution to these unresolvable challenges through reducing the computational complexity and improving generalization performance.
In a recent review [ 1 ], different scenarios dealing with classical and quantum machine learning are discussed with different possible combinations: using classical machine learning to analyze classical data, using quantum machine learning to analyze both classical and quantum data, and using classical machine learning to analyze quantum data. Notable examples of these can be seen: quantum control using classical reinforcement learning, learning unitaries with optimal strategies, making quantum computers more reliable with help of machine learning and speedup in various learning algorithms. As a result, quantum computing and machine learning can co-evolve and will enable technologies for each other.
One of the challenges mentioned in this paper [ 1 ] is the difficulty in extracting the learned information without destroying the information itself, as the laws in quantum mechanics, such as the no-cloning theorem, make this task extremely hard. These difficulties motivate us to think more broadly what it means for a quantum system to learn about its surroundings. Besides using quantum algorithms for data analysis, quantum machine learning can also investigate more fundamental questions about the concept of learning from the perspective of quantum theory.
K-nearest neighbor (KNN) algorithm is a simple and intuitive supervised machine learning algorithm, whose learning model is made of storing the training dataset and can be used for either classification or regression. In general, the aim of a supervised learning algorithm is to infer the input-output relation via learning from training data. To make a prediction for a new test data point, the algorithm uses the K closest data points (K nearest neighbors) in the training dataset using some predefined distance. Since this algorithm uses only the training data points in the prediction and does not need any iterations of going through the training data to build a model before the prediction, it is called instance-based learning or lazy learning.
It is easy to see that the performance of this algorithm depends critically on the choice of K. A small value of K could allow the noise in the data to have a higher influence while a large value requires more CPU time. Most data analysis practitioners suggest selecting where N is the number of training data points.
Euclidean distance is a common choice for continuous data points and Hamming distance is a good choice for discrete data points. It is obvious that the selection of a distance metric plays a critical role in the performance of this algorithm, which determines how to find the nearest neighbors. The second question is how to use the nearest neighbors to make a prediction after they have been identified. For this end, K is typically chosen as an odd number in case majority votes are employed for final prediction. Therefore, it makes decision based on the entire training data set or in the best case a subset of them.
One extension to the majority votes is not to give 1 vote to all the neighbors. A common strategy is to assign a weight to each neighbor based on its distance. For instance under inverse distance weighting, each point has a weight equal to the inverse of its distance to the point to be queried, implying that closer neighbors have a higher vote than the farther ones. This approach has avoided the hard task to choose K as the value of K is implicitly hidden in the weights [ 2 ].
Further, the famous “curse of dimensionality” can also be seen in this algorithm. Imagine we have 1000 training data points uniformly distributed in the unit hypercube and our test data point is at the origin. In 1-dimensional space, it takes about a distance of 3/1000 = 0.003 on average to get 3 nearest neighbors. In 2-dimensional space, it takes about a distance of and in n-dimensional space, it takes about a distance of in each direction as the training data points become sparsely distributed when the dimension of the space increases.
Another feature of this algorithm is non-parametric since it does not make any assumptions about the distribution of data in contrast some other algorithms may assume a Gaussian distribution of the given data for example. This makes this algorithm more robust and versatile as in the real world most of the practical data does not follow the typical theoretical assumptions. For example, beyond classification and regression, it can be used in density estimation, since being non parametric allows it to do estimation for arbitrary distributions. Because it is lazy, this algorithm does not use the training data points to do any generalization but only memorizing the training instances. (Figure 1 & Figure 2)
The two primary and prominent costs of nearest neighbor algorithm are: storage of all training data points and CPU time to compute the distances of the query data point to all training data points, which are huge challenge in big data. Quantum nearest neighbor algorithms offer solutions to these two issues quantum mechanically and beautifully. Due to superposition, quantum algorithms can store all training data points of exponentially large size as a linear size. And because of entanglement and interference, they can compute the distances at once.
As quantum machine learning is an emerging research field, it is constructive and enlightening to investigate the actual work of these new algorithms. Along in this direction, we have finished a few papers for this end [ 3 - 5 ]. In [ 3 ], we create some artificial datasets to visualize the working of a distance-based quantum classifier and extend their quantum circuit from binary classification to a multiclass classification. In [ 4 ], the training of an AI agent for its decision making is compared on an ion trap quantum system and on a superconducting quantum system, and discover that latter is more accurate than the former and tends to underestimate the values for the agent to make a decision when compared with the ion trap system. In [ 5 ], a quantum neuron is created with a nonlinear activation function like a sigmoid function
Figure 1. A diagram to show the work of KNN. One test data point (a star in the center) should be classified either to the class of circles or to the class of triangles. If K = 3 (solid line circle) it is assigned to the class of triangles because there are 2 triangles and only 1 circle inside the inner circle. If we increase the value of K, the prediction outcome may change.
Figure 2. A diagram to show the work of NN in general. One test data point (a star in the center) should be classified either to the class of circles or to the class of triangles. The distance from this test data point to all the training data points are computed and its class membership is determined with the closest data points. As in this Figure, the test point is predicted to be in class of circles.
which is a classical activation function commonly used in classical neurons.
Following the same principle as in [ 3 - 5 ], this current study chooses two related quantum nearest neighbor algorithms [ 2 , 6 ] and uses a simple dataset to demonstrate how they work, reveal their quantum nature, and compare their performances in detail using IBM’s quantum simulator [ 7 ].
We outline the two algorithms from [ 2 , 6 ] used in the current study. Each training data point is represented as n features and its class label , where is an index of the point in a training dataset and . Each training data point is expressed as a quantum state of
. The test data point is represented as and the whole training data points are super-positioned as .
2.1. One Quantum Nearest Neighbor Algorithm from [ 2 ]
This algorithm is the quantum version of the classical weighted nearest neighbor algorithms and is inspired from the quantum associative memory in [ 8 ]. The idea is to super position all the training data points into one quantum state and then compute the Hamming distance of each training point to the test point into the amplitude of each training point in superposition. After this, measuring the class qubit reveals the appropriate class with highest probability. Here is an outline of the algorithm.
Step one: superposition all training points into one quantum state
Step two: add one ancilla qubit to this state
Step three: apply the Hadamard gate to the ancilla qubit
Step four: to get the Hamming distance, apply X gate the state and CNOT to with is the control and is the target
Step five: apply the unitary operator
to add the Hamming distance of each training point . The state after this operation is
where is the Hamming distance between and .
Step six: apply another Hadamard gate on the ancilla qubit
Step seven: from step six, it is clear that if the test point is close to the training points then we have a higher probability to measure the ancilla qubit in the state , otherwise we should see it in the state . The probability of getting is
At this point of time, there are two paths to take. The first way is to take a “K à all” approach by assigning the class of training points that are on average closer to the test point. The second is measure the class qubit and retrieve the neighbors with a probability weighted by their distance and choose a class from this pool of training points.
Following the first path, we need to measure the class qubit. To better show the different classes appearing weighted by their distance to the test point, is rewritten as
The probability of measuring a class conditioned on measuring the ancilla qubit in is
Following the second path, the test point is assigned the class that is in the majority of its nearest neighbors.
2.2. One Quantum Nearest Neighbor Algorithm from [ 6 ]
Inspired by the algorithm introduced in section 2.1, a quantum KNN algorithm is proposed in [ 6 ]. In addition to the well-known parameter K, it also requires another parameter t as we will explain next.
Steps one and two are the same as those outlined in section 2.1.
Here the CNOT gate uses as control and as target, and as a result the meaning of the is reversed, i.e., if two bits are the same then , otherwise . The reason for this reversion is explained in the next step.
Step four: compute the Hamming distance with and label the test point according to a threshold value t. This operation can be done with this unitary operator U:
where is a set that contains indexes p with Hamming distance between and .
Recall that the meaning is reversed, so the Hamming distance ≤t implies . If K is chosen according to , and let , then the condition Hamming distance ≤ t can be rewritten as
From this inequality, if the initial , then the condition of Hamming distance ≤ t can be determined by whether the sum of overflows or not.
Step five: measuring the ancilla qubit in the state of can find the training points whose Hamming distance is less than t. In this algorithm there are two parameters, K and t, to select.
To compare the two classifiers from [ 2 , 6 ] on a fair base, we only evaluate them by the nearest neighbors they select rather than the final classification results they produce, since this step is the most critical before a final classification is rendered and can be compared straightforwardly.
3.1. Dataset for Comparing the Two Algorithms
Our aim is to create a dataset that is manageable so our analysis of these two classifiers can provide understanding of how they work, reveal their quantum nature, and compare their performances. For this purpose, we create a test dataset with all 4 bit binary numbers which makes n = 4 consequently and choose four particular numbers of them as training points with the first two training points in class 0 and second two in class 1 (Table 1). Furthermore, in order to establish a benchmark to compare the two quantum classifiers in section 2, we choose K = 2 and find the nearest neighbors for each test point manually based on the Hamming distance, which provide the best possible results (Table 1).
3.2. Performance of the First Classifier in Section 2.1
According to the probability formula in Step six in section 2.1, we calculate the theoretical probability of observing each test point and each training point when measuring the ancilla qubit in state in Table 2. Since there are four training points, the maximum probability for a test point to be seen together with a training point is 0.25. The execution of this algorithm on IBM’s quantum simulator produces the observed probabilities that match the theoretical ones very well (Table 2 and Figure 3).
Each entry in Table 2 represents the theoretical or the experimental probability of observing one test point and one training point together when measuring ancilla qubit in state . To render a better illustration of these two ways of computing their probabilities, we plot them in curves in Figure 3 and we
Table 1. This table contains two sub tables. On the left is a sub table for Hamming distances between test points and training points. On the right is a sub table for the nearest neighbors of each test point selected by the Hamming distance and when K = 2, where 1 means a nearest neighbor, 0 means otherwise. This sub table displays the best possible results which will be used as a benchmark to compare the two algorithms in the next two sections.
Table 2. This table contains the probabilities computed from the theory of the algorithm in section 2.1 in the left sub table and the actual observed probabilities when running the algorithm on IBM’s simulator with shots = 8192 in the right sub table.
can visually see that the two results match perfectly, giving the random nature of reading these values on a quantum simulator.
For the sake of comparing this classifier with the one in section 2.2, we set K = 2 in this experiment to satisfy the condition of here n = 4, as required by the classifier in section 2.2. The algorithm in section 2.1 offers to two different path ways to get the final classification based on the measured probabilities, while the one in section 2.2 produces the final classification based on whether or not there is an overflow in the addition of the Hamming distances to the quantum register that holds the value of a, which depends on the value of t. For this reason, we choose to compare these two classifiers based on the nearest neighbors they generate rather than the final predication. To select the nearest neighbors for the algorithm in section 2.1, for each test point we choose the two training points with top two probabilities as nearest neighbors since K = 2. From the theory shown in Table 2, there are two equal probabilities in some cases. If this happens, we choose three nearest neighbors instead of two.
Now we have converted the probabilities of seeing a test point and a training point together generated by the classifier in section 2.1 into the nearest neighbors of a test point with the assumption that K = 2 (Table 3). With this groundwork, we can move forward to execution of the algorithm in section 2.2.
Figure 3. This figure plots the actual numerical values in Table 2 to visualize how closely the experimental results match the theoretical results. The x-axis represents the 16 test points as defined in Table 1.
3.3. Performance of the Second Classifier in Section 2.2
Since n = 4, we choose K = 2 to satisfy the condition . Let t = 1, 2, 3 respectively and the algorithm finds the nearest neighbors accordingly when running it on IBM’s quantum simulator (see Table 4). Recall from section 2.2 that the selection of nearest neighbors in this algorithm is to check if there is an overflow or not when we add the Hamming distances to the quantum register that hold the value of a(t). So only in this sense, the nearest neighbor selection process in this algorithm is deterministic.
Quantum computing is emerging as potential solution to tackle machine learning, optimization, search and other challenges that are beyond the capability of the classical computers. In this study, we compare two similar quantum nearest neighbor algorithms on a simple dataset using IBM’s quantum simulator. Our analysis suggests that the one from [ 2 ] exhibits flexibility without the need to choose K and allows the selection of K to be built into the probability of selecting a nearest neighbor. The one from [ 6 ]
Table 3. This table displays the nearest neighbors for each test point selected by the algorithm in section 2.1 when K = 2, where 1 means a nearest neighbor, 0 means otherwise. The results in this table match 100% with the best possible results in Table 1.
Table 4. This table displays the nearest neighbors for each test point selected by the algorithm in section 2.2 when K = 2 and t = 1, 2, 3, respectively, where 1 means a nearest neighbor, 0 means otherwise. Compared with the best possible results in Table 1, when t = 1, there are 29 mismatches, t = 2 match perfectly, t = 3 have 16 mismatches. The algorithm is run on IBM’s simulator with 1000 shots.
requires the choice for K and has an additional parameter t to set up and as a result, its performance heavily depends on the value of t. Both have their runtime independent of dataset size N but dependent on the number of features n for each data point, which offers a huge advantage in processing big data [ 2 , 6 ]. The quantum nature of these two classifiers is better revealed when testing them on a manageable dataset proposed in this study. The classifier from [ 2 ] demonstrates its perfect results to match the best possible results, in contrast, the results of second one from [ 6 ] vary according to different values of the parameter t, assuming both choose K = 2.
In addition to the papers that we have cited in this paper so far, we are also inspired by and have benefited from reading these ones [ 9 - 17 ].
We thank IBM Quantum Experience for use of their quantum computers and simulators.