biais.org

Wednesday 17 September 2008

Ruby For a Python Programmer

I'm looking for a website comparing Ruby and Python idioms. I'm a Python programmer and I always use the same idioms to write programs (list and dict comprehension, loops on slices, ...). I found a good resource that describes some similarities and differences between Ruby and Python but I didn't find any idioms list in Ruby, so if you know one or if you are a Ruby programmer, please post a comment.

Friday 29 August 2008

EVOL-ution : Darwin's graffiti

Brilliant idea, thanks to Kriebel

Friday 13 June 2008

Russian Word Stress Dictionary

I'm trying to learn russian since a few weeks. I was looking on the Internet for a Russian-English dictionary with stress on Russian words because this is the only tool I needed to learn spoken russian by myself. I found the eSpeak Project, they worked on a russian dictionary with stress associated to each word. It's great but it's annoying to look for a word in a big text file... That's why I wrote a small django+jquery frontend to query the dictionary easily. Also I don't have russian keyboard so I add a small transliteration tool to the interface.

You can access it here: http://www.biais.org/russian-stress/

EDIT: It doesn't work in IE, Get Firefox

EDIT2: It's now working in IE, anyhow Get Firefox

Note: The dictionary is not perfect but it contains about 220000 entries.

Monday 26 May 2008

Screened Emacs Launcher

I'm used to run emacs from my shell and my mind is not able to switch from the command emacs to emacs-client when I have an opened windows. This is why I wrote this simple shell script that:

  • run emacs (and force server-start) in detached screen with a particular id (emax) if this screen doesn't already exist
  • run emacs-client (with the -n option : don't wait for the server to return) else
[shell]
#!/bin/bash

screen -list |grep emax > /dev/null
if [ $? -eq 1 ]; then
	echo "screening -- emacs $@"
	screen -S emax -d -m emacs -f 'server-start' $@
else
	echo "connect to emacs server and detach -- emacs $@"
	emacsclient -n $@
fi

I prefer to get a separate emacs instance when I'm writing mail because I can focus on it. You may want to have special cases for this, use this script instead :

[shell]
#!/bin/bash

# special case for mutt mail edition
if [[ "$1" =~ "/tmp/mutt"  ]]; then
    echo "attached"
    detach=0
else
     echo "detached"
    detach=1
fi

screen -list |grep emax > /dev/null
if [ $? -eq 1 ]; then
    if [ $detach -eq 1 ]; then
	echo "screening -- emacs $@"
	screen -S emax -d -m emacs -f 'server-start' $@
    else
	echo "normal mode -- emacs $@"
	emacs -f 'mail-mode' $@
    fi
else
    if [ $detach -eq 1 ]; then
	echo "connect to emacs server and detach -- emacs $@"
	emacsclient -n $@
    else
	echo "connect to emacs server -- emacs $@"
	emacsclient $@
    fi
fi

Note: I also set a zsh alias to emacs on this script

Saturday 5 April 2008

Machine Translation Techniques and Open Source

Today there two main approaches to Machine Translation (MT)

  • Rules based MT (used by numbers of companies working in the domain: Systran, Reverso, etc.). The only open source software I know that works with this approach is Apertium.
  • Statistical based MT (used by Google and Language Weaver). Moses is an open source implementation of this approach. Also, the learning process is supported by other open source layers. (for example giza++ is an open source word aligner needed by moses to prepare the corpus).
Pros and cons of rules based machine translation
  • It needs rules, dictionaries (general and contextual) and people with the know how (linguists) to write this rules and fill dictionaries.
  • Translation costs (CPU and memory) are fairly low
Pros and cons of statistical based machine translation
  • It needs big bilingual corpus and computer ressources to run the learning process
  • The bilingual corpus have to be clean (automatic pre process and human checking)
  • Translation costs are heavy
  • You can translate in all pair languages you want if you got the corpus
Resources:

Notes: there is other less used techniques; word to word substitution (Linguaphile, example based translation (I didn't find open source implementation of this one), of course, you can imagine mixed techniques.

Monday 24 March 2008

ack: a better grep for programmers

ack is a grep like for programmers. I'm used to run grep -R and find ... -exec grep to search for something in my code or in others code. But since I found ack, I definitely switched to ack when I code. ack website.

My favourites features:

  • Color highlighting of search results
  • Searches recursively through directories by default, while ignoring .svn, CVS and other VCS directories
  • Many command-line switches are the same as in GNU grep, so the transition is nothing

ack 1.78 is out

Saturday 1 March 2008

Two common database mistakes

A really well explained post about 2 database mistakes

Mistake #1: treating a database as a dumb object store. This is a really popular idea right now- Hibernate does this, as does Ruby on Rails, and a number of other ORM packages take this effective approach. On the other hand, dynamically typed languages are also really popular.

[...]

Mistake #2: file formats (and this includes marshalled data structures), are wire protocols, and need to be designed to be as abstract as possible- to reveal as little about the internal structure of the program as possible (preferrably none at all).

[...]

Tuesday 8 January 2008

StaticGenerator for Django: create static files for lightning fast performance

StaticGenerator is a Python class for Django that makes it easy to create static files for lightning fast performance. It accepts strings (URL), Models (class or instance), Managers, and QuerySets in a simple syntax.

StaticGenerator project page

The benchmark seems very favorable for StaticGenerator against Django cached data.

Wednesday 19 December 2007

CAPTCHA resistance test: results

In this blog post, I wanted to test the spammer crawlers. Experience time : 2007-02-08 to 2007-12-18, more than 10 months. Some results:

  • 114 Mo of pure spam
  • 17443 mails (average: 150 / day during the last 2 months)
  • 9498 mails in the ceresistan mailbox (with the mailto: link)
  • 7945 mails in the recetansis mailbox (text only)
  • 0 mails in others mailbox (fortunately spammers don't use visual captcha breaker today)

A chart of the number of spam (both mailboxes) received per day during the test:

I forgot to test this one:

  • retancesis (at) biais (dot) org

Wednesday 14 November 2007

Natural Language Tokenizer That Keeps Track Of Token Locations

I'm using nltk for a personal project. It's a great library providing many tools for natural language processing. It provides different kinds of tokenizers but these tokenizers only cut string into substring without keeping track of location or other useful metadata. I needed to have tokens location (line and column number of the token) in the original text so I wrote this simple tokenizer imitating the function nltk.wordpunct_tokenize:

import re
 
def wordpunct_tokenize_position(stream):
    """
    Tokenize and store location of tokens from a stream or a string
    >>> list(wordpunct_tokenize_position('nltk is great'))
    [('nltk', (0, 0)), ('is', (0, 5)), ('great', (0, 8))]
    >>> list(wordpunct_tokenize_position('nltk\\nis\\ngreat'))
    [('nltk', (0, 0)), ('is', (1, 0)), ('great', (2, 0))]
    >>> list(wordpunct_tokenize_position('nltk is nltk'))
    [('nltk', (0, 0)), ('is', (0, 5)), ('nltk', (0, 8))]
 
    """
    if isinstance(stream, basestring):
        sourceiterable = stream.splitlines() # not an iterator
    else:
        sourceiterable = stream.readlines()
    regex = re.compile(r'(\w+|[^\w\s]+)')
    for line_number, line in enumerate(sourceiterable):
        for match in regex.finditer(line):
            yield match.group(1), (line_number, match.start())
 
if __name__ == "__main__":
    import doctest
    doctest.testmod()