biais.org

Monday 5 January 2009

Genetic Algorithm in Python to Generate File Converters

In an old post I wrote about a metaheuristic: particle swarm optimization (PSO). It was the simplest code to demonstrate what PSO looks like. Today I'm writing about genetic algorithm, another metaheuristic inspired by evolution of species. You can read the description of the algorithm on the genetic algorithm article on Wikipedia.

I applied the algorithm to a problem that is not really one: trying to help lazy programmers to write file converters. I had to write file converters to unify all (more or less) formated input files into one kind of CSV file. Each of these converters is made using the right combination of filtering / regexp matching / line splitting.

So I wrote a program based on genetic algorithm where an individual is composed by 3 "genes":

  • a list of filters (remove consecutive spaces, replace tabs by spaces, ...)
  • a list of basics regexp (int, float, date, line separator, ...)
  • a list of cleaning functions (remove empty cell, merge cells, ...)

The fitness of an individual is calculated using a sample file to parse that goes through the genes (filters / regexp / cleaning functions). The result of this process is a CSV file that can be compared to the expected sample result (that I wrote manually). The comparison uses the SequenceMatcher of the difflib Python stantard module, it returns the fitness: a float. A fitness close to 1.0 means the results is close to the expected CSV format. When the fitness is exaclty 1.0, that means the converter works perfectly for the sample.

Here is a sample file I wanted to convert:

08/12   	 Hello world 06/12
Paris 121231231 	  	- 22,29
08/12 	Something something ... 04/12
1111 1111 1111 1111 	  	- 14,35
08/12 	something else
	  	- 12,96
26/11 	Vir AAAAA
AAAAAA 2008	  	264,51

into this csv file:

"08/12","Hello world 06/12","Paris 121231231","-22,29"
"08/12","Something something ... 04/12","1111 1111 1111 1111","-14,35"
"08/12","something else","","-12,96"
"26/11","Vir AAAAA","AAAAAA 2008","264,51"

The program prints for each generation the 5 bests individual (I'm using elitism, so the best individual is always kept from a generation to the next one). A sample run:

$ python ga.py tests/sample1.txt tests/sample1.expected result1.pickled
Generation: 0 (mutation rate=10)
0.53772
0.42507
0.30406
0.28598
0.26389
Generation: 1 (mutation rate=10)
0.53772
0.42507
0.32684
0.30406
0.26799

...

Generation: 271 (mutation rate=15)
1.0
0.96025
0.95107
0.91779
0.87886
"08/12","Hello world 06/12","Paris 121231231","-22,29"
"08/12","Something something ... 04/12","1111 1111 1111 1111","-14,35"
"08/12","something else","","-12,96"
"26/11","Vir AAAAA","AAAAAA 2008","264,51"

filters: str_remove_consecutive_spaces, str_remove_somespaces, str_strip, str_remove_consecutive_spaces
regex: ([0-9]{2}/[0-9]{2})(.+?)(
)(.+?)(-?[0-9]+[.,]+[0-9]+)
cleaners: clean_strip, _in, _in, _in
Conclusion

The good:

  • It's fun to see your program evolve ;)
  • You are lazy and don't want to write _many_ of this kind of file converters
  • With a good interface and if the basic functions and genes size are well defined, a non-programmer can create his own parser (I think this is the only argument to use it).

The bad:

  • The resulted parsers are not optimal

The ugly:

  • Of course it's quite easy to write such genes by hand
  • It may never converge to 1.0 and generated parsers of fitness != 1.0 are useless (this may not be the case for other kind of GA applications)
  • You need many well chosen primitives to cover a wide set of solutions
Download

gaparser-0.1.tar.bz2: source code and examples.

Wednesday 17 September 2008

Ruby For a Python Programmer

I'm looking for a website comparing Ruby and Python idioms. I'm a Python programmer and I always use the same idioms to write programs (list and dict comprehension, loops on slices, ...). I found a good resource that describes some similarities and differences between Ruby and Python but I didn't find any idioms list in Ruby, so if you know one or if you are a Ruby programmer, please post a comment.

Wednesday 14 November 2007

Natural Language Tokenizer That Keeps Track Of Token Locations

I'm using nltk for a personal project. It's a great library providing many tools for natural language processing. It provides different kinds of tokenizers but these tokenizers only cut string into substring without keeping track of location or other useful metadata. I needed to have tokens location (line and column number of the token) in the original text so I wrote this simple tokenizer imitating the function nltk.wordpunct_tokenize:

import re
 
def wordpunct_tokenize_position(stream):
    """
    Tokenize and store location of tokens from a stream or a string
    >>> list(wordpunct_tokenize_position('nltk is great'))
    [('nltk', (0, 0)), ('is', (0, 5)), ('great', (0, 8))]
    >>> list(wordpunct_tokenize_position('nltk\\nis\\ngreat'))
    [('nltk', (0, 0)), ('is', (1, 0)), ('great', (2, 0))]
    >>> list(wordpunct_tokenize_position('nltk is nltk'))
    [('nltk', (0, 0)), ('is', (0, 5)), ('nltk', (0, 8))]
 
    """
    if isinstance(stream, basestring):
        sourceiterable = stream.splitlines() # not an iterator
    else:
        sourceiterable = stream.readlines()
    regex = re.compile(r'(\w+|[^\w\s]+)')
    for line_number, line in enumerate(sourceiterable):
        for match in regex.finditer(line):
            yield match.group(1), (line_number, match.start())
 
if __name__ == "__main__":
    import doctest
    doctest.testmod()

Monday 12 November 2007

Reinteract, A Better interactive Python (in GTK)

Reinteract is a good alternative to Python shell or IPython, watch the screencast. Excepting graphical and "GUI objects" inclusion, it's easy to reproduce the behavior with emacs + mode-python and the py-execute-region function.

You can get it from the git repository:

git clone git://git.fishsoup.net/reinteract

Blog post announce

Thursday 18 October 2007

How to write a resume OR CV OR Vitae to be spotted by Google HR

I got a script running on this server that try to look for unusual referrers in my apache access.log, recently it found this line:

65.57.245.11 - - [xx/xx/xxxx:xx:xx:xx +0200] "GET /blog/data
/cv-maxime_biais-en.pdf HTTP/1.0" 200 54736 "http://www.googl
e.com/custom?q=intitle:(resume+OR+CV+OR+Vitae)+%22C%2B%2B%22+
%22Software+(Engineer+OR+Architect+OR+Programmer)%22+Python&h
l=en&client=pub-6116397745508461&cof=FORID:1%3BAH:left%3BCX:S
earch4Candidates%3BL:http://www.google.com/coop/images/google
_custom_search_sm.gif%3BLH:55%3BLP:1%3BGFNT:%23666666%3BDIV:%
23cccccc%3B&cx=003801465318207546779:5c5132mwnqy&adkw=AELymgW
0pVdk7NwNw7cXNIpOpKa3-mkuy3nVruyomR-uq_cmYbfti5RND-XpsliIWGfC
Ed_NEB3f--XthNt0VqWxOO4eTnZ5-Nc_XrqU_tmkWrslLS52BtRbBtRAgj_Dn
MTXeOGVw5_qjTNkBNF60wvDmXltsGdSSw&start=250&sa=N" "Mozilla/5.
0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.7) Gecko/20070
914 Firefox/2.0.0.7"

So what ?:

  • The IP: 65.57.245.11 is the gateway of Google mountain view employees.
  • The referrer is a custom search engine and it searches for webpages that contains (resume OR CV OR Vitae) in title, and C++ and Python and (Software Engineer OR Architect OR Programmer) in everything.

Try the referer url.

Now, you know what to write in your resume title if you want Google HR to find you.

Jablit documentation

A small but sufficient documentation to configure, run and use jablit.

Tuesday 16 October 2007

Jablit: text mode jabber (google talk) client

I was looking for a good text mode jabber client to talk with my friends / distant co-worker. I only found centericq but I don't like the interface, so I begin to write a very small text mode client : jablit. The early first release is not documented but i'm using all day long, it's stable and perfectly fits my needs ( I hope it will fit your ;) ).

Wednesday 5 September 2007

Viterbi algorithm variant in Python

A variant of the Viterbi Algorithm returning the less energy path of a matrix. Example:

 
M = [[4, 1, 3],
     [3, 4, 2],
     [2, 1, 4]]

This matrix M is projected to the following virtual trellis:

In this example, the path' energy from top to bottom [1, 1, 1] (passing the node 1 then 4 then 1) is 6 ( = abs(1-4) + abs(4-1)).

def viterbi(matrix):
    """Viterbi algorithm : find the best 8-connected path in a matrix.
    
    Example, here the best path from top to bottom is to follow the "4"s 
    (diagonal), the energy of the path is 0
    >>> viterbi([[4, 1, 3], \
                 [3, 4, 2], \
                 [2, 1, 4]])
    [0, 1, 2]
    
    On the next example, 2 best paths, only the first found is returned
    >>> viterbi([[1, 2, 2], \
                 [1, 1, 2], \
                 [1, 2, 1]])
    [0, 0, 0]
    
    A basic example. Best path energy: 1
    >>> viterbi([[1, 0], \
                 [4, 2]])
    [0, 1]
    """
    # init
    h = len(matrix) - 1
    w = len(matrix[0])
    k = 0
    S = {}      #Survivors
    L = {}      #Accumulate Length
    for i in range(w):
        L[(k, i)] = 0
 
    # recursion
    for k in range(h):
        for x in range(w):
            for dx in range(-1, 2):
                if (x + dx) < 0 or (x + dx) >= w:
                    continue
                tmpl = L[k, x] + abs(matrix[k][x] - matrix[k+1][x+dx])
                if (not (k+1, x+dx) in L) or (tmpl < L[(k+1, x+dx)]):
                    L[(k+1, x+dx)] = tmpl
                    S[(k+1, x+dx)] = x
 
    # look for the smallest
    tmp, idx = -1, -1
    for i in range(w):
        if L[(h, i)] < tmp or tmp == -1:
            tmp = L[(h, i)]
            idx = i
 
    # create the inversed path
    res = [idx]
    for k in range(h, 0, -1):
        idx = S[(k, idx)]
        res.append(idx)
 
    # reverse the path
    res.reverse()
    return res

Notes:

Monday 25 June 2007

How to write Caml extensions for Python

I wanted to write a simple function in Caml, compile it and load it as a Python module, but it was a bit more complex that I thought. I tried to use pycaml or py2caml to avoid writing C code; without success. Then I read the Extending Python With C chapter in the Python doc and the Interfacing C with Objective Caml chapter from OCaml manual.

There is 4 main steps to write this example module:

  1. Write the Caml function you want to use in your Python program (in the example file mop.ml)
  2. Write the C code to wrap the Caml code (mop_mlwrapper.c)
  3. Write the C code to declare the Python module (mop_module.c)
  4. Compile all and link everything together (not the easiest when you're not familiar with Caml compilation process)

Caml code:

let square a =  a * a;;

(* declare square function as name "square" to call it from C code *)
let _ = Callback.register "square" square;;

C code to wrap the Caml function:

#include "caml/callback.h"
 
void initcaml(void)
{
  char *dummy[] = {0};
 
  // Initialize the Caml interpreter
  caml_startup(dummy);
}
 
int mlmop(int a)
{
  int res;
  static value *f = NULL;
  value r_caml;
 
  // Get a pointer on the caml code "square"
  f = caml_named_value("square");
  if (f == NULL)
    return -1;
  // Run the Caml function pointed by f
  r_caml = callback(*f, Val_int(a));
  // Convert the return value to an integer
  res = Int_val(r_caml);
  return res;
}

C code, Python extension:

#include "Python.h"
 
extern int mlmop(int);
extern void initcaml();
 
static PyObject *
testmop(PyObject *self, PyObject *args)
{
  int arg, res;
 
  if (!PyArg_ParseTuple(args, "i", &arg))
    return NULL;
 
  res = mlmop(arg);
  // convert to a Python int
  return Py_BuildValue("i", res);
}
 
static PyMethodDef MopMethods[] = {
  {"testmop", testmop, METH_VARARGS, "test function calling caml code"},
  {NULL, NULL, 0, NULL}        /* Sentinel */
};
 
 
// function needed to initialize the module and call the caml init
PyMODINIT_FUNC
initmop(void)
{
  (void) Py_InitModule("mop", MopMethods);
  initcaml();
}

The Makefile:

  1. Compile the ml file in object file (-output-obj option)
  2. Compile both C files in object files
  3. Link all object files in a shared library. The library must be linked with dependency libraries : libcamlrun, libunix, libm, libcurses
PY_PREFIX=/usr
PY_VERSION=2.4

LIBS = -L/usr/lib/ocaml/3.09.2/ -L$(PY_PREFIX)/lib/python$(PY_VERSION)/config -lpython$(PY_VERSION) -lunix  -lcamlrun -lm -lcurses 
INCS = -I$(PY_PREFIX)/include/python$(PY_VERSION) -I/usr/lib/ocaml/3.09.2/

all: mop.so

mop_module.o: mop_module.c
	gcc -ggdb -Wall -c mop_module.c $(INCS) -o mop_module.o

mop_mlwrapper.o: mop_mlwrapper.c
	ocamlc -ccopt -Wall -c mop_mlwrapper.c

mop.o: mop.ml
	ocamlc -o mop.o -output-obj -ccopt -fPIC mop.ml

# Note: under MacOS X replace -shared by -dynlib
mop.so: mop_module.o mop_mlwrapper.o mop.o
	gcc -shared --whole-archive mop_mlwrapper.o mop.o mop_module.o $(LIBS) -o mop.so

test: mop.so
	python$(PY_VERSION) test.py

test2: mop.so
	python$(PY_VERSION) -c "import mop; print 'square(2) =', mop.testmop(2)"

clean:
	rm -rf *o *so *cmi *cmo

After the compilation process, when mop.so is built, you can run import it as a Python module:

import mop
 
for i in range(15):
    print mop.testmop(i)

Here is the archive of the Caml, C and Python code and Makefile.

Next step: the same example with pycaml or py2caml.

Tuesday 3 April 2007

Plotmoi: plot anything in svg

I started a new project called plotmoi, hosted by code.google: http://code.google.com/p/plotmoi/. The goal is to plot any formated or non formated data. After some hours of coding plotmoi can generate interesting svg without pain. For example, I extracted from a database the following table in tests/reallife_data_5.data:

   size   | count 
 ---------+-------
        1 | 73719
       17 | 58136
       21 | 51949
       19 | 47431
       16 | 47027
       15 | 46303
       24 | 44450
       25 | 44388
 ...
 (25673 rows)

Then run plotmoi:

$ python src/plotmoi.py tests/reallife_data_5.data plotmoi_loglog_example.svg

It generated this svg file (log/log scale). bitmap preview:

You can retrieve sources via subversion:

svn checkout http://plotmoi.googlecode.com/svn/trunk/ plotmoi