Although this task may look rather straightforward – use A* or similar algorithm, real life can make the problem much more complex: you have to find an exit from the maze when you do not know where it is.
You can only observe maze (or grid world – widely used term in RL community) at the point where you are located at the moment or where you were before (to some extent). You do not know the size of a maze or where exit is located. This makes A* family of algorithms useless.
Grid world’s environment provides you with some reward like -1 for every step and +100 for the exit, and it can also suddenly move you or affect in some other way (consider sudden wind blow which moves your helicopter or road slope which affects how your car moves).
We created a rather simple solver for this kinds of worlds using reinforcement learning. It is yet exponential to solve the maze, also it requires quite a lot of steps to converge, but this is only the beginning.
Every path agent takes adds into global convergence map using Bellman’s equation, it requires about 100k random transactions to complete (arrow shows the steepest way to the discovered and most rewarded destination point). Since every transaction runs from the beginning to the end using either steepest descent learned so far or some random exploration path, this is equivalent to Monte-Carlo search method. It is rather slow to converge, and it should be replaced with temporal difference TD(lambda) algorithm, we expect it to be about 100 times faster.