Bayesian knowledge tracing (BKT)

Ingredients

Here are some key definitions, basically following [Corbett1995].

knowledge component (KC): A skill, a rule, or a piece of principle that a student is supposed to learn through the prepared activity.
lesson: A sequence of activities, indexed by \(n = 1, 2, ...\), whose goal is to teach a KC.
\(p(L_n)\): The probability that the student knows the KC after step \(n-1\) and before step \(n\). Defined as such, this is the prior probability, as opposed to the posterior probability defined below (Eq. (2)). The ideal outcome of the lesson is that the series \(p(L_n)\) converges to 1. In such a case, the prior probability and the posterior probability become indistinguishable and define the knowledge state.

Clearly, one must make some estimate of the initial knowledge, \(P(L_1)\). Here are the four parameters of a BKT model, including \(P(L_1)\).

Symbol	Meaning	Definition
\(p(L_1)\)	Initial knowing	The probability that the student already knows prior to lesson.
\(p(T)\)	Transition	The probability of becoming knowledgeable at a step.
\(p(G)\)	Guess	The probability of guessing correctly without knowledge.
\(p(S)\)	Slip	The probability of make a mistaken choice with knowledge.

Here, all parameters could be assumed to be independent of student, or some or all parameters can be assumed to be dependent on student. For instance, it is reasonable that \(p(L_n)\) be taken to be dependent on individual student, while \(p(T)\) could be taken as independent of student, if it is largely dependent on the quality of the task.

Inference chain

The following inference chain is what makes it possible to trace the student knowledge:

(1)\[P(L_1) \rightarrow P\left(\left.L_1\right|\text{evidence}\right) \rightarrow P(L_2) \rightarrow P\left(\left.L_2\right|\text{evidence}\right) \rightarrow \ ...\]

The core mechanism of this inference chain is the posterior probability that follows from Bayes’ theorem:

(2)\[p\left(\left.L_n\right|\text{evidence}\right) = \frac{p\left(\text{evidence}\left|L_n\right.\right)\cdot p(L_n)} {p(\text{evidence})}.\]

Here, evidence refers to either “getting the correct answer at step \(n\)” or “getting an incorrect answer at step \(n\).” These two events can be referred to with symbols \(C_n\) and \(I_n\), respectively. Their probabilities satisfy the sum rule:

(3)\[P(C_n) + P(I_n) = 1.\]

Then, the following two equations follow directly from (2) and the definition of \(p(S)\):

\[\begin{split}p\left(\left.L_n\right|C_n\right) &= \frac{(1-p(S))\cdot p(L_n)}{p(C_n)},\\ p\left(\left.L_n\right|I_n\right) &= \frac{p(S)\cdot p(L_n)}{1-p(C_n)}.\end{split}\]

The probability to make the correct choice is given by

(4)\[p(C_n) = p(L_n) \cdot \left(1 - p(S)\right) + \left(1 - p (L_n)\right) \cdot p(G).\]

So the posterior probability can be calculated from the prior probability assuming that we know what the values of \(p(S)\) and \(p(G)\).

Now, in order to complete the problem, we must specify how to go from the posterior probability at step \(n\) to the prior probability at step \(n+1\). This is where parameter \(p(T)\) comes in:

(5)\[p(L_{n+1}) = p\left(\left.L_n\right|\text{evidence}\right) + \left(1 - p\left(\left.L_n\right|\text{evidence}\right)\right) \cdot p(T).\]

So, now, one can see that Equations (2), (3), (4), and (5) completely specify the Bayesian inference chain for the student knowledge.

Convergence

In the above, it is clear that the ideal outcome \(p(L_n) \rightarrow 1\) is a possible point to which the inference chain, Eq. (1), can converge to. Near convergence, we get

(6)\[p(C_n) \approx 1 - p(S),\quad p(L_n|\text{evidence}) \approx p(L_n),\quad p(L_{n+1}) \approx p(L_n),\]

where in the second expression, “evidence” can be either \(C_n\) or \(I_n\). So, when the knowledge is near convergence, the only parameter that determines the pattern of evidence is \(p(S)\) according to this model.

For the future

How about the perturbation theory near convergence?

Is any other convergence possible? That is, is there a non-trivial fixed point for the mapping \(p(L_n) \rightarrow p(L_{n+1})\), regardless of the evidence (in some average sense)? In a simple minded mathematical way, the answer is no. But, it seems possible that the evidence can fluctuate up and down and \(p(L_n)\) can stay at the same value on average. Such semi-convergence might occur if the student is not really trying, but is randomly choosing answers, out of boredom or fatigue.