Parameters
The key part of a BKT application is estimating the best parameters.
The four parameters, \(p(L_1), p(G), p(S)\) and \(p(T)\), must be optimized in order to ensure the quality of the model. By definition, the best parameters are those that make the model perform the best in predicting the outcome of student work in a task related to the acquired knowledge.
Slip parameter
This parameter, \(p(S)\), is critical, since when \(p(L_n)\) approaches 1, this is the single parameter that matters the most, as discussed in Section Convergence.
Important as it is, it is also up to some interpretation.
Here is some discussion [Baker2010]:
• Recently, there has been work towards contextualizing the guess and slip
parameters (Baker, Corbett, & Aleven, 2008a, 2008b)
• Do we really think the chance that an incorrect response was a slip is
equal when
– Student has never gotten action right; spends 78 seconds thinking; answers;
gets it wrong
– Student has gotten action right 3 times in a row; spends 1.2 seconds
thinking; answers; gets it wrong
Also, in [Gobert2013], two interesting points are made about the slip parameter. First, slip seems to occur more easily for students who initially struggled and then attained the mastery. Second, possibly, the slip parameter is, partially, a reflection of the different student perception of knowledge even when the mastery has been declared by the learning software.
For the future
In an ideal model, one would think that the slip tendency would decrease as more time is spent to do a task. Perhaps an exponential function of time is appropriate.
However, one cannot rule out a distraction factor. So, after a certain time, the slip parameter may bottom out, or even go up slightly.
If the lesson is continued after the mastery is acquired, then \(p(L)\) might not be changing much, but \(p(S)\) may be decreasing—one might call this a hardening of knowledge or a tempering of knowledge. It is not enough to acquire knowledge. It is necessary to apply the acquired knowledge to different problems, gain experience, and make it a more rounded one. To do this is to reduce \(p(S)\) mainly, I think.
These thoughts suggest that \(p(S)\) must be made a function of time and the number of activities “during the hardening period.” We may regard spending time or trying various applications as anti-slip hardening process.
Optimization
Now, it has been stated already that parameters (four, or more, if the slip parameter is modeled further) must be optimized. In the literature, various approaches seem to have been tried, including a brute force approach of making four dimensional grids and evaluating all \(\chi^2\) values, and finding the set of parameters that minimize \(\chi^2\). This is equivalent to minimizing the “residual* as explained in [BakerWWW].
It appears that many efforts are made on this front—this is understandable since the multi-parameter least squares fit is always a rather ill-defined process due to the local minima. It seems reasonable that this type of approach would work fine, as long as the model is reasonable and the parameter ranges are narrow so that the answers are already clear from the beginning.
In any case, the following approach might be tried as an improvement to a brute force approach by [BakerWWW].
Define \(D(n)\) as the data to be fit. This is the grade to student response and the value of \(D(n)\) within the BKT model is either 1 or 0, in the simple BKT model. However, it could have a value ranging from 0 to 1, end points included.
Now, the theory function can be calculated as a function of parameters \(T(n; a_i, D)\). This function will give a value ranging from 0 to 1, end points included. Here, \(a_i\)’s are fit parameters with \(i=1,2,3,4\): they are \(p(L_1), p(G), p(S)\) and \(p(T)\), respectively.
Note that \(T\) depends not only \(n\) and \(a_i\), but also \(D\), the data itself. So, it is not a conventional function, as it is a functional of \(D\).—Does the Levenberg-Marquardt theory continue to work in this case?
Let us assume that the standard Levenberg-Marquardt theory works fine; in fact, we may not even worry about the theory aspect, to some extent, since, well, all we want to achieve is the minimization of \(\chi^2\). Then, we can call the Levenberg-Marquardt algorithm for \(T\) fitting \(D\), since the algorithm is that of finding the minimum by following the steepest descent.
Try random initial values for the initial parameters and make a map of converged results.
For the future
The above approach may be modified to include the anti-slip-hardening. In this case, \(D\) must be regarded as \(D(n,t)\) where \(t\) is the short hand notation for all the time information during the lesson.
The fitting procedure above will not change; only the computation of the theory function \(T(n;a_i, D, t)\) will be now more complicated. Modeling the hardening process, we should add more input parameters so we will have more than 4 \(a_i\) parameters. If we assume that
a thought-invoked hardening (the more time student spends, the less slip) is parameterized one time scale parameter, \(\tau\),
the exercise-driven hardening (the more problems student solves, the less slip) is parameterized by one scale parameter, \(N\),
and the threshold for \(p(L)\) is given by some number close to 1 (hardening kicks in only if the mastery is nearly achieved),
then we will have seven parameters in total, not four. Within the Levenberg-Marquardt algorithm, this is perfectly doable, while the brute force method will suffer greatly, as the number of parameters increase.