Untitled Document

The problem. The mine field.

The suggestion of this problem, I got it from the book "Machine Learning" by Tom M. Mitchell, when he talks about reinforcement learning in chapter 13.

Suppose we have the following hypothetical situation, based on the following picture.

Figure 1. The mine M is determined by the attribute D, which is the level of destruction, and the states A and Pa. The state A is true or false whenever the mine is active or not, and Pa is a probabilistic function on t, whose value is the probability of A to be true at timestamp To; simply set, the probabilty of being blown away when one steps on M.

An agent L must go through the field F, starting at point E and ending at point G. L is defined by three states : i,j (location on F) and C (current condition after the last damage). F is composed by cells with coordinates i,j (i,j on [0..5]). There are mines spread all over the place, i.e, for all i,j exists one and only one mine Mij. A mine is determinated by the attribute D, which is the level of damage it could produce to L, in case of explotion. A mine has also two states A and Pa. The state A is true or false whenever the mine is active or not, and Pa is a probability function defined on t, whose value is the probability of A to be true at timestamp To. If L step on M, when M is active, then the consequence is equivalent to the level of damage D.

The information available to L is its own situation, the situation of E and G, the fact that there is one mine per cell, the current time given by a clock, and an idea of the probabilistic distribution of the activation of the mines.

The mission is to cross the field from E to G with a maximum condition C. Observe that the more time L spends on the field, the more damage it is exposed to. The agent can move to whatever cell around or can even stay on the same cell (waiting for a better "wether") the next unit of time. We consider that the time L uses to jump one cell is a unit of time, so it is possible to associate a cost of time to every cell, which is the time L would spend to get to G with no mines at all, and choosing the shortest path. In the picture L is in a cell with time cost 2. Observe also that the decision L takes, has to consider the time it takes when a path is selected because the whole situation can change (remember the relation between Pa and t).

Well, L has a lot of information, but the time is always running out and the "body" can take no more. What to do?

Back to index