Genetic Algorithm in Python to Generate File Converters
By Maxime Biais, Monday 5 January 2009 at 16:51 :: Python :: #72 :: rss

In an old post I wrote about a metaheuristic: particle swarm optimization (PSO). It was the simplest code to demonstrate what PSO looks like. Today I'm writing about genetic algorithm, another metaheuristic inspired by evolution of species. You can read the description of the algorithm on the genetic algorithm article on Wikipedia.
I applied the algorithm to a problem that is not really one: trying to help lazy programmers to write file converters. I had to write file converters to unify all (more or less) formated input files into one kind of CSV file. Each of these converters is made using the right combination of filtering / regexp matching / line splitting.
So I wrote a program based on genetic algorithm where an individual is composed by 3 "genes":
- a list of filters (remove consecutive spaces, replace tabs by spaces, ...)
- a list of basics regexp (int, float, date, line separator, ...)
- a list of cleaning functions (remove empty cell, merge cells, ...)
The fitness of an individual is calculated using a sample file to parse that goes through the genes (filters / regexp / cleaning functions). The result of this process is a CSV file that can be compared to the expected sample result (that I wrote manually). The comparison uses the SequenceMatcher of the difflib Python stantard module, it returns the fitness: a float. A fitness close to 1.0 means the results is close to the expected CSV format. When the fitness is exaclty 1.0, that means the converter works perfectly for the sample.
Here is a sample file I wanted to convert:
08/12 Hello world 06/12 Paris 121231231 - 22,29 08/12 Something something ... 04/12 1111 1111 1111 1111 - 14,35 08/12 something else - 12,96 26/11 Vir AAAAA AAAAAA 2008 264,51
into this csv file:
"08/12","Hello world 06/12","Paris 121231231","-22,29" "08/12","Something something ... 04/12","1111 1111 1111 1111","-14,35" "08/12","something else","","-12,96" "26/11","Vir AAAAA","AAAAAA 2008","264,51"
The program prints for each generation the 5 bests individual (I'm using elitism, so the best individual is always kept from a generation to the next one). A sample run:
$ python ga.py tests/sample1.txt tests/sample1.expected result1.pickled
Generation: 0 (mutation rate=10)
0.53772
0.42507
0.30406
0.28598
0.26389
Generation: 1 (mutation rate=10)
0.53772
0.42507
0.32684
0.30406
0.26799
...
Generation: 271 (mutation rate=15)
1.0
0.96025
0.95107
0.91779
0.87886
"08/12","Hello world 06/12","Paris 121231231","-22,29"
"08/12","Something something ... 04/12","1111 1111 1111 1111","-14,35"
"08/12","something else","","-12,96"
"26/11","Vir AAAAA","AAAAAA 2008","264,51"
filters: str_remove_consecutive_spaces, str_remove_somespaces, str_strip, str_remove_consecutive_spaces
regex: ([0-9]{2}/[0-9]{2})(.+?)(
)(.+?)(-?[0-9]+[.,]+[0-9]+)
cleaners: clean_strip, _in, _in, _in
Conclusion
The good:
- It's fun to see your program evolve

- You are lazy and don't want to write _many_ of this kind of file converters
- With a good interface and if the basic functions and genes size are well defined, a non-programmer can create his own parser (I think this is the only argument to use it).
The bad:
- The resulted parsers are not optimal
The ugly:
- Of course it's quite easy to write such genes by hand
- It may never converge to 1.0 and generated parsers of fitness != 1.0 are useless (this may not be the case for other kind of GA applications)
- You need many well chosen primitives to cover a wide set of solutions
Download
gaparser-0.1.tar.bz2: source code and examples.




Comments
1. Monday 5 January 2009 at 19:44, by Bill Tozier
2. Friday 9 January 2009 at 18:05, by Vasudev Ram
3. Saturday 10 January 2009 at 02:06, by Joseph D. Gutierrez
4. Thursday 29 January 2009 at 15:28, by Astonmartin
5. Thursday 29 October 2009 at 22:07, by Mark Essel
Write your comment