Here I will briefly discuss the concept behind the learning robot and its environment based on the variable structure stochastic learning automaton (VSLA). This will enable the robot to learn similar like a child.
First of all, the robot can choose from a finite number of actions (e.g. drive forwards, drive backwards, turn right, turn left). Initially at a time t = n = 1 one of the possible actions α is chosen by the robot at random with a given probability p. This action is now applied to the random environment in which the robot "lives" and the response β from the environment is observed by the sensor(s) of the robot.
The feedback β from the environment is binary, i.e. it is either favorable or unfavorable for the given task the robot should learn. We define β = 0 as a reward (favorable) and β = 1 as a penalty (unfavorable). If the response from the environment is favorable (β = 0), then the probability pi of choosing that action αi for the next period of time t = n + 1 is updated according to the updating rule Τ.
After that, another action is chosen and the response of the environment observed. When a certain stopping criterion is reached, the algorithm stops and the robot has learned some characteristics of the random environment.
We define furthermore:
is the finite set of r actions/outputs of the robot. The output (action) is applied to the environment at time t = n, denoted by α(n).
is the binary set of inputs/responses from the environment. The input (response) is applied to the robot at time t = n, denoted by β(n). In our case, the values for are β chosen to be 0 or 1. β = 0 represents a reward and β = 1 a penalty.
is the finite set of probabilities a certain action α(n) is chosen at a time t = n, denoted by p(n).
Τ is the updating function (rule) according to which the elements of the set P are updated at each time t = n. Hence
where the i-th element of the set P(n) is
with i = 1,2,...,r,
and
is the finite set of penalty probabilities that the action αi will result in a penalty input from the random environment. If the penalty probabilities are constant, the environment is called a stationary random environment.
The updating functions (reinforcement schemes) are categorized based on their linearity. The general linear scheme is given by:
If α(n) = αi,
where a and b are the learning parameter with 0 > a,b < 1.
If a = b, the scheme is called the linear reward-penalty scheme, which is the earliest scheme considered in mathematical psychology.
For simplicity we consider the random environment as a stationary random environment and we are using the linear reward-penalty scheme. It can be seen immediately that the limits of a probability pi for n → ∞ are either 0 or 1. Therefore the robot learns to choose the optimal action asymptotically. It shall be noted, that it converges not always to the correct action; but the probability that it converges to the wrong one can be made arbitrarily small by making the learning parameter a small.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.