Traditional computers have reached their limits of processing the ever growing data today. Quantum computing created from computer science and quantum physics emerges as a new computation paradigm, holding the promise to generate the next computing revolution. While classical computers can only store or process one of the two bits “0” or “1”, quantum computers can make use of the superposition of two quantum states and . As such they can store classically exponential size of data as linear size and can explore many qubits simultaneously. Put it in another way, a quantum computer can store all the binary numbers of the form simultaneously while a classical computer can only store one of them. As a result, a quantum computer can calculate multiple function f(x) values simultaneously, but a classical computer can only do one at a time. However, due to the laws of quantum mechanics, in general we cannot get all these function values by one single measurement. Rather, the key challenge in quantum algorithm design is to be creative in getting more information out through one measurement.
In certain areas such as machining learning, big data, and artificial intelligence, quantum computing has demonstrated dramatic speedups. The peculiar features of quantum computing such as super position, entanglement, and interference of quantum states are generally considered resources for this speed up. Examples of quantum machine learning algorithms can be found in [ 1 - 9 ].
Reinforcement learning is an area of machine learning in which a learning agent can learn from its actions taken in an environment. The nature of this kind of learning is illustrated in how the agent adjusts its actions in order to achieve its goal, say getting the maximum of reward. This is different from a supervised learning where the model learns from data directly. In this sense, we can say that reinforcement learning is learning from its actions and their feedback from the environment, so it learns from data (of interactions with the environment) indirectly (Figure 1). Some of the well-known reinforcement learning algorithms are Q-learning and SARSA [ 10 - 12 ].
A new model of reinforcement learning based on projective simulation is proposed in [ 13 ] and another variant of it, reflecting projective simulation, is shown to gain a quadratic speed up in the agent’s deliberation time when it needs to adapt to rapidly changing environments [ 14 ].
As quantum machine learning is an emerging new field, it is very necessary to understand the characteristics of different quantum computing implementations and how they affect the performance of quantum algorithms. IBM recently released the Quantum Experience to make quantum computing available to the public. It allows users to connect to IBM’s quantum processor via the IBM Cloud to learn the nature of quantum computing and to create and test different quantum algorithms on a real quantum processor [ 15 ]. One recent work [ 16 ] investigates the impact on a quantum classifier caused by the routinely used swap operation between two qubits on IBM’s 5Q because of its star topology (Figure 2). Some other work using IBM’s 5Q can be found in [ 17 , 18 ], and a good textbook on quantum computing is [ 19 ].
Figure 1. An illustration of the general idea of reinforcement learning: the learning of an agent is through its interactions with its environment. It takes an action and then the environment gives the feedback to the agent, and then agent learns to do better next time.
Figure 2. A schematic demonstration of the five qubits in IBM’s 5Q chip, an image taken from
2. Reflecting projective simulation
2.1. General Description
In projective simulation, a learning agent receives a sensory input from its environment, that is, a percept (from some set of percepts ) and, based on the percept, it takes an action from the possible set of actions This learning model uses a memory for its previous sequences of percept-action events, helping the agent to simulate future action before real action is taken. In a sense, the reflection on the percept-action history or memory helps the agent to choose a better action.
The process of the internal memory could be modeled as a Markov chain, and therefore can be conducted by discrete-time stochastic diffusion processes. They can be realized in a variety of physical systems, say in a quantum computing system. The deliberation process of the agent is based on these diffusion processes, in which the relevant Markov chain is diffused a particular number of times until a desired action is output. The choice of the action is dictated by the probability distribution from a Markov chain realized by the diffusion process.
The projective simulation learning also allows for many additional structures, which can be some percept-specific flags that are subsets of actions assigned to each percept to represent the agent’s short-term memory, a feature that significantly improves the performance of the model [ 14 ].
In reflecting projective simulation (RPS), the reflection process of the agent is defined as many repetitions of the diffusion processes, where the agent takes its actions based on a specific probability distribution that can be updated during the learning process. This can helps the agent to react to the changing environment. In another word, there are two kinds of time quantities 1/δ and 1/ε, where the first measures time needed to generate the specified distribution in the agent’s internal memory and the second is the time to sample a desirable action from it.
The work in [ 20 ] creates a quantum algorithm for an ion trap quantum processor that reduces the deliberation time to , compared with the 1/(δε) time in the classical case. It studies a simplified version of RPS that uses rank-one Markov chains. In this special case, the entire Markov chain can converge in one step (δ = 1). So the time efficient issue of learning only has to deal with the time to sample the correct actions from the converged probability distribution which serves as like a policy for the agent in Q-learning or SARSA. The general algorithm can be modified as follows. First assume there are n flagged actions out of N actions ( ) and use to denote the initial probability distribution for these flagged actions, and to represent the final stationary probability distribution for these
flagged actions. In the initialization stage, the state is prepared and the optimal number of k diffusion steps is carried out [ 21 ], with k = round where is
the probability to sample a flagged action from the final stationary distribution. The reflection over these actions can be defined as:
After running the diffusion step a few time, a sample is taken from the distribution and if the sampled action is marked with a flag then the agent will take it as its action, otherwise the algorithm starts over again. The diffusion process consists of two reflections, over the flagged actions and over the distribution. The findings from [ 20 ] suggest that their quantum RPS (Q-RPS) agents requires an average of samples until obtaining a flagged action while classical agents need O (1/ε) samples.
The goal of Q-RPS algorithm is to increase the probability of obtaining a flagged action such that while maintaining the relative probabilities of the flagged actions according to the initial distribution, i.e., for any . Therefore, the main focus of the study in [ 20 ] is how to increase and maintain the same ratios of and given the initial distribution of . One way to illustrate the application of Q-RPS is through a simple example. Next we will explain a toy game that demonstrates the situations where this quantum learning agent can used.
2.2. Invasion Game
Let us introduce a simple game called invasion (see Figure 3). It has two parties, an attacker (A) and a defender (D). The goal of the attacker A is to enter the other side of the wall by going through a hole in the wall, which are placed at equal distances. The defender D will try to block a hole and thereby prevent A from invasion.
Assume that the attacker A uses a strategy that is unknown to the defender D, but, before each move, A shows some symbol that indicates its next move. In this case, it could be an arrow pointing right or left, indicating the direction of the subsequent move. The meaning of the symbols is priori unknown to D, but the symbols can be perceived and distinguished by D. In order for D to win the game, it needs to learn the meaning of the symbols during the game time. Here we also assume that the meaning of the symbols could be changed after some time, say to switch the meaning of the arrows so the right arrow means going to left. In summary, there are at least two learning tasks for D. One is to learn the meaning of the symbols and the other is to relearn the meaning if they get changed.
The main purpose of our work is to empirically study the quantum advantage for the defender D if it chooses to use some quantum algorithms for its learning as already illustrated in [ 20 ].
2.3. Quantum Circuit Design for Q-RPS
The ion trap quantum system in [ 20 ] uses two qubits to represent four action states: , , , and , two of which are flagged: and . The initial probability distribution of is prepared in this state:
here , is the Pauli matrix Y for qubit j. , is the Pauli matrix Z for qubit j. Given , and can be obtained via and #Math_43# respectively.
The reflection over the flagged actions is defined as:
and the reflection over the initial probability distribution of is represented as:
where is the unitary matrix for the controlled not gate. The whole Q-RPS is encoded in the following quantum circuit (see Figure 4).
Figure 3. An illustration of the invasion game [ 13 ]: the Attacker agent A tries to go to the other side of the wall by entering one of the holes in the wall, while the Defender agent D tries to guess A’s next move from learning the meaning of a symbol shown.
Figure 4. An implementation of the circuit in [ 20 ] for the Q-RPS in IBM’s quantum composer, where θ1 = 2.0394 and θ2 = 1.5694 as used in the experiments in [ 20 ]. The first two U3 gates prepare the input states, the U1 gate serves as refA and the gates after U1 make up refα.
There is a randomness in quantum computing, therefore, a quantum algorithm has to repeat multiple times in order to get a stable reading of its results. IBM’s quantum experience calls each such time a shot. Since the experimental results from [ 20 ] is based on 1600 shots, we decide to run the Q-RPS algorithm on IBM’s 5Q using 1000 shots or multiples of 1000 to set up a fair base for comparison. Another reason is that a maximum of 1024 shots is the limit with our current credit. We use 1000 shots on IBM’s simulator to match the same number of shots on IBM’s 5Q, and also the maximally allowed 8192 shots on the simulator to get the best possible results. Because quantum computing involves randomness, all our experiments are conducted with various shots on IBM’s 5Q or IBM’s quantum simulator and the averaged results are reported here.
3.1. Running the Experiments in [ 20 ] on IBM’s 5Q Processor and IBM’s Quantum Simulator
The major task of the work in [ 20 ] is to elucidate the two characteristic features of rank-one Q-RPS: 1) given , how much can be increased while maintaining the same ratio value during the whole process (Table 1) 2) how does the ratio of depend on that of in the process of increasing (Table 2). The first experiment is to fix and let vary so the optimal k value moves from 1 to 7 and the second fix to two values so the optimal k = 1 in one case and k = 3 in the other case and let change. Our purpose here is to run the same experiments as those reported in [ 20 ] on their ion trap quantum processor, but on IBM’s 5Q computer. Hopefully different behaviors of these kinds of quantum hardware systems can be observed. The findings in [ 20 ] imply that their ion trap system tends to overestimate the values it computes for the Q-RPS agent. Using IBM’s simulator as a baseline estimator, Table 1 shows that the values generated by IBM’s 5Q (1000 shots) are closer to the true values of than those by the ion trap system (1600 shots) from [ 20 ] but the theoretical analysis indicates that the two ratios are supposed to be the same. However, due to the noise and decoherence of quantum computing, it is not observed here. To visualize the quantum hardware difference between the ion trap system and IBM’s 5Q, we also display the bar charts of the two (Figures 5-7). Numerically, there is no clear difference between 1000 shots and 10x1000 shots for IBM’s 5Q, implied by Table 1 and Figures 5-7.
Table 2 and Figure 6 and Figure 7 highlight that IBM’s 5Q is more accurate in the case for k = 1 and underestimate in the case for k = 3, remembering that the ion trap system shows overestimate in both cases [ 20 ]. More specifically, the values from IBM’s 5Q are closer to the values than those from the
Figure 5. Bar chart to visualize the key numerical values in Table 1. The theoretical εb value is 1, so all other εb values should get close to 1.
Table 1. Experiment that changes input εa and see how well ion trap, 5Q, or simulator can increase εa. In this case γa remains 1.
Table 2. Experiment that changes ra and see how well rb can keep close to ra. The theory suggests that they should be the same.
Figure 6. Bar chart to visualize the key numerical values in Table 2 for k = 1. The values in the x-axis represent the seven different inputs in Table 2 for k = 1. This plot shows ion trap tends to overestimate and IBM’s 5Q is more accurate.
Figure 7. Bar chart to visualize the key numerical values in Table 2 for k = 3. The values in the x-axis represent the six different inputs in Table 2 for k = 3. This plot shows ion trap tends to overestimate and IBM’s 5Q tends to underestimate.
ion trap (k = 1 in this case), and the values from IBM’s 5Q show underestimate while those from ion trap display overestimate.
Here we show the results from IBM simulator with 1000 shots to illustrate the best possible outcome for the results from IBM 5Q with 1000 shots. The case of and is a difficult learning task as IBM’s simulator runs 10 times of the maximum shots of 8192 and can only get 1.98 with the target of 1.9994. Also, with 10 × 1000 shots, it gets 1.90002.
It would be informative if we could run the experiment for the case of and in Table 2 on IBM’s simulator to see the whole process of values approaching to its final number 1.98 using various shots from 20 to 8192. The result of this run is shown in Figure 7 with values along with their standard deviations. We also run the same experiment on IBM’s 5Q with various shots from 20 to 1000, see Figure 8. It is interesting to see that std = 0.0981 for IBM’s 5Q and std = 0.1204 for IBM’s simulator at 1000 shots, and the real device’s std value is even smaller than that of the simulator. In order for the simulator to reduce its std value to std = 0.0974, it needs to run up to 2000 shots. The values in Figure 8 and Figure 9 are all based 10 repetitions of each run since we need to take a standard deviation of these values.
Figure 8. Curve for rb and curve for its standard deviation for input a00 = 0.03357, a01 = 0.01679 from IBM simulator with different shots 20 - 8192.
Figure 9. Curve for rb and curve for its standard deviation for input a00 = 0.03357, a01 = 0.01679 from IBM 5Q with different shots 20 - 1000.
To summarize our results in this section, we can say that using IBM’s simulator as a baseline, IBM 5Q is more accurate than the ion trap quantum system in calculating the probabilities for the flagged actions for the agent and also tends to underestimate if any.
3.2. Comparing the Quantum Circuits with and without Pi Rotation on IBM’s 5Q Processor and IBM’s Quantum Simulator
From the geometric meaning of , we can simplify the reflection equation in (3) by removing the pi rotation:
We first show that these two circuits are equivalent empirically by running the same experiment using IBM’s quantum simulator with maximum 8192 shots, which show that they produce the same outcome. Then we test them on IBM’s 5Q. For this purpose we select one of the hard learning tasks in experiment shown in Table 1, which requires k = 7 to achieve the optimal result.
It also allows us to run the experiment for various k values before and after k = 7 to gain the whole picture of their work. Surprisingly enough, our simplified circuit outperforms the original one. Note that in this run, we use the same and for each k in the range of 1 to 7, showing the gradual approaching to the limit values of and .
For each k from 1 to 7, the two circuits produce the same results on IBM’s simulator, minus the randomness of quantum computing. The reason to choose maximum of 8192 shots in this run is to ensure the randomness effects of quantum computing is reduced to a possible minimum. Table 3 shows that our simplified circuit is equivalent to the original one, and the natural question is how they differ on a real quantum device? For this end, we first run them on IBM’s simulator to establish a baseline line for subsequent comparison (Figure 10), then run them on IBM’s 5Q, and the results are displayed in Figures 11-13, which indicate that the performance of the simplified circuit is much closer to the best possible result of the original circuit in [ 20 ]. To get a bigger picture, we intentionally run our experiment from k = 1 up to k = 9 to reveal their performance before and after the optimal value k = 7.
Table 3. Experiment to show that the performances of the original circuit and our simplified one on IBM’s simulator are the same for input a00 = 0.0055 and a01 = 0.0055 which is used in Table 1.
Figure 10. Bar chart to visualize the key numerical values in Table 3, which serves as a baseline for the performance comparison in Figures 11-13. In this plot, the values in the x-axis are the k values from 1 to 9. The optimal k value is 7, but here we show many k values before and after 7 to reveal the whole picture. This plot also shows that the optimal k value is 7.
Figure 11. Run the original circuit on IBM’s 5Q with 10 × 1000 shots. In this plot, the values in the x-axis are the k values from 1 to 9. The bar charts in this figure are quite different from the simulator in Figure 10.
Figure 12. This figure is the same as Figure 11 with one extra: we have added the result from the ion trap in [ 20 ] for k = 7 here to visually show the behaviors. We place this result at position 10 and we should not think it is for k = 10. It is the run for k = 7 from [ 20 ]. The goal of this experiment is to increase εb value, so we can see that at k = 7, IBM’s 5Q is better than that of the ion trap at position 10.
Figure 13. This is the same figure as Figure 11 but for the results from our simplified circuit. We can see that at each k, the 50 - 50 ratio of b00 and b01 is well maintained and k = 6 is the best result. This is expected as the round off formula for k in section 2.1 can be off by one. The patterns of the bar charts in this figure are similar to the best possible results from the simulator in Figure 10, which suggests that the results in this figure from our simplified circuit are better than those from the original one in [ 20 ].
As we are having more data and facing more complex problems today than yesterday, reaching the limits of the capability of classical computers, the call for advancement of quantum computation hardware and quantum algorithms is so clear and urgent. Quantum machine learning has become a matter of interest recently due to its great potential as a possible solution to some hard computing challenges. Research has shown that artificial intelligence, and in particular machine learning can benefit from quantum computing.
In a typical reinforcement learning, a learning agent needs to learn from its interactions with its environment. But the environment could change in the real world. Consequently, the agent needs to react to this change. How a quantum processor can speed up the decision making for the agent in this case is the focus of the study in [ 20 ], which also suggests that their ion trap system tends to overestimate. Our study investigates how IBM’s 5Q computer enables the learning of this AI agent.
Our work uses IBM’s simulator as a baseline and demonstrates that IBM’s 5Q is more accurate than the ion trap system and tends to underestimate if any, during the decision making of this agent. Furthermore, our analysis reveals that the quantum circuit used in [ 20 ] could be simplified. When tested on IBM’s 5Q, our simplified circuit performs better than the ion trap system on one of the hard learning tasks reported in Table 1.
Quantum computing relies on multiple repetitions of the same experiment to get a high accurate answer. In this direction, our analysis seems to demonstrate that the current quantum computing technologies do not provide highly accurate calculation when needed, compared with the accuracy that the classical computing offers today.
As more and more quantum machine learning algorithms are being developed and tested on different quantum hardware systems, it is desirable to learn the influence of quantum hardware on the performance of these quantum algorithms. Our work helps to understand the potential of a quantum computer in the field of artificial intelligence and has developed new insight into how the features and constraints of quantum hardware affect the quantum algorithms.
We are grateful to Theeraphot Sriarunothai, one of the authors of [ 20 ], for valuable discussions.
 Schuld, M., Fingerhuth, M. and Petruccione, F. (2017) Implementing a Distance-Based Classifier with a Quantum Interference Circuit. A Letters Journal of Exploring the Frontiers of Physics, 119, 60002.
 Schuld, M., Sinayskiy, I. and Petruccione, F. (2014) Quantum Computing for Pattern Classification. In: Palm, D.-N. and Park, S.-B., Eds., Pacific Rim International Conference on Artificial Intelligence, Spring International Publishing, Switzerland, 208-220.
 Schuld, M., Sinayskiy, I. and Petruccione, F. (2014) Quantum Walks on Graphs Representing the Firing Patterns of a Quantum Neural Network. Physical Review A, 89, Article ID: 032333.
 Cai, X.-D., Wu, D., Su, Z.-E., Chen, M.-C., Wang, X.-L., Li, L., Liu, N.-L., Lu, C.-Y. and Pan, J.-W. (2015) Entanglement-Based Machine Learning on a Quantum Computer. Physical Review Letters, 114, Article ID: 110504.
 Ganger, M., Duryea, E. and Hu, W. (2016) Double Sarsa and Double Expected Sarsa with Shallow and Deep Learning. Journal of Data Analysis and Information Processing, 4, 159-176.
 Paparo, G.D., Dunjko, V., Makmal, A., Martin-Delgado, M.A. and Briegel, H.J. (2014) Quantum Speedup for Active Learning Agents. Physical Review X, 4, Article ID: 031002.
 Alsina, D. and Latorre, J.I. (2016) Experimental Test of Mermin Inequalities on a Five-Qubit Quantum Computer. Physical Review A, 94, Article ID: 012314.
 Sriarunothai, T., Wolk, S., Giri, G.S., Friis, N., Dunjko, V., Briegel, H.J. and Wunderlich, C. (2017) Speeding-Up the Decision Making of a Learning Agent using an Ion Trap Quantum Processor.
 Grover, L.K. (1996) A Framework for Fast Quantum Mechanical Algorithms. In: Proceedings of the 28th Annual ACM Symposium on Theory of Computing, ACM, New York, 212-219.