biais.org

Saturday 5 April 2008

Machine Translation Techniques and Open Source

Today there two main approaches to Machine Translation (MT)

  • Rules based MT (used by numbers of companies working in the domain: Systran, Reverso, etc.). The only open source software I know that works with this approach is Apertium.
  • Statistical based MT (used by Google and Language Weaver). Moses is an open source implementation of this approach. Also, the learning process is supported by other open source layers. (for example giza++ is an open source word aligner needed by moses to prepare the corpus).
Pros and cons of rules based machine translation
  • It needs rules, dictionaries (general and contextual) and people with the know how (linguists) to write this rules and fill dictionaries.
  • Translation costs (CPU and memory) are fairly low
Pros and cons of statistical based machine translation
  • It needs big bilingual corpus and computer ressources to run the learning process
  • The bilingual corpus have to be clean (automatic pre process and human checking)
  • Translation costs are heavy
  • You can translate in all pair languages you want if you got the corpus
Resources:

Notes: there is other less used techniques; word to word substitution (Linguaphile, example based translation (I didn't find open source implementation of this one), of course, you can imagine mixed techniques.

Monday 24 March 2008

ack: a better grep for programmers

ack is a grep like for programmers. I'm used to run grep -R and find ... -exec grep to search for something in my code or in others code. But since I found ack, I definitely switched to ack when I code. ack website.

My favourites features:

  • Color highlighting of search results
  • Searches recursively through directories by default, while ignoring .svn, CVS and other VCS directories
  • Many command-line switches are the same as in GNU grep, so the transition is nothing

ack 1.78 is out

Saturday 1 March 2008

Two common database mistakes

A really well explained post about 2 database mistakes

Mistake #1: treating a database as a dumb object store. This is a really popular idea right now- Hibernate does this, as does Ruby on Rails, and a number of other ORM packages take this effective approach. On the other hand, dynamically typed languages are also really popular.

[...]

Mistake #2: file formats (and this includes marshalled data structures), are wire protocols, and need to be designed to be as abstract as possible- to reveal as little about the internal structure of the program as possible (preferrably none at all).

[...]

Tuesday 8 January 2008

StaticGenerator for Django: create static files for lightning fast performance

StaticGenerator is a Python class for Django that makes it easy to create static files for lightning fast performance. It accepts strings (URL), Models (class or instance), Managers, and QuerySets in a simple syntax.

StaticGenerator project page

The benchmark seems very favorable for StaticGenerator against Django cached data.

Wednesday 19 December 2007

CAPTCHA resistance test: results

In this blog post, I wanted to test the spammer crawlers. Experience time : 2007-02-08 to 2007-12-18, more than 10 months. Some results:

  • 114 Mo of pure spam
  • 17443 mails (average: 150 / day during the last 2 months)
  • 9498 mails in the ceresistan mailbox (with the mailto: link)
  • 7945 mails in the recetansis mailbox (text only)
  • 0 mails in others mailbox (fortunately spammers don't use visual captcha breaker today)

A chart of the number of spam (both mailboxes) received per day during the test:

I forgot to test this one:

  • retancesis (at) biais (dot) org

Wednesday 14 November 2007

Natural Language Tokenizer That Keeps Track Of Token Locations

I'm using nltk for a personal project. It's a great library providing many tools for natural language processing. It provides different kinds of tokenizers but these tokenizers only cut string into substring without keeping track of location or other useful metadata. I needed to have tokens location (line and column number of the token) in the original text so I wrote this simple tokenizer imitating the function nltk.wordpunct_tokenize:

import re
 
def wordpunct_tokenize_position(stream):
    """
    Tokenize and store location of tokens from a stream or a string
    >>> list(wordpunct_tokenize_position('nltk is great'))
    [('nltk', (0, 0)), ('is', (0, 5)), ('great', (0, 8))]
    >>> list(wordpunct_tokenize_position('nltk\\nis\\ngreat'))
    [('nltk', (0, 0)), ('is', (1, 0)), ('great', (2, 0))]
    >>> list(wordpunct_tokenize_position('nltk is nltk'))
    [('nltk', (0, 0)), ('is', (0, 5)), ('nltk', (0, 8))]
 
    """
    if isinstance(stream, basestring):
        sourceiterable = stream.splitlines() # not an iterator
    else:
        sourceiterable = stream.readlines()
    regex = re.compile(r'(\w+|[^\w\s]+)')
    for line_number, line in enumerate(sourceiterable):
        for match in regex.finditer(line):
            yield match.group(1), (line_number, match.start())
 
if __name__ == "__main__":
    import doctest
    doctest.testmod()

Tuesday 13 November 2007

Django schema evolution (or schema migration)

django-evolution is a response to schema evolution for django models. A quote from the website:

Django Evolution is an extension to Django that allows you to track changes in your models over time, and to update the database to reflect those changes.

Notes:

  • A Wiki page on the django website discussing how to implement schema evolution.
  • Another tool for django migration: DbMigration

Monday 12 November 2007

Reinteract, A Better interactive Python (in GTK)

Reinteract is a good alternative to Python shell or IPython, watch the screencast. Excepting graphical and "GUI objects" inclusion, it's easy to reproduce the behavior with emacs + mode-python and the py-execute-region function.

You can get it from the git repository:

git clone git://git.fishsoup.net/reinteract

Blog post announce

Thursday 18 October 2007

How to write a resume OR CV OR Vitae to be spotted by Google HR

I got a script running on this server that try to look for unusual referrers in my apache access.log, recently it found this line:

65.57.245.11 - - [xx/xx/xxxx:xx:xx:xx +0200] "GET /blog/data
/cv-maxime_biais-en.pdf HTTP/1.0" 200 54736 "http://www.googl
e.com/custom?q=intitle:(resume+OR+CV+OR+Vitae)+%22C%2B%2B%22+
%22Software+(Engineer+OR+Architect+OR+Programmer)%22+Python&h
l=en&client=pub-6116397745508461&cof=FORID:1%3BAH:left%3BCX:S
earch4Candidates%3BL:http://www.google.com/coop/images/google
_custom_search_sm.gif%3BLH:55%3BLP:1%3BGFNT:%23666666%3BDIV:%
23cccccc%3B&cx=003801465318207546779:5c5132mwnqy&adkw=AELymgW
0pVdk7NwNw7cXNIpOpKa3-mkuy3nVruyomR-uq_cmYbfti5RND-XpsliIWGfC
Ed_NEB3f--XthNt0VqWxOO4eTnZ5-Nc_XrqU_tmkWrslLS52BtRbBtRAgj_Dn
MTXeOGVw5_qjTNkBNF60wvDmXltsGdSSw&start=250&sa=N" "Mozilla/5.
0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.7) Gecko/20070
914 Firefox/2.0.0.7"

So what ?:

  • The IP: 65.57.245.11 is the gateway of Google mountain view employees.
  • The referrer is a custom search engine and it searches for webpages that contains (resume OR CV OR Vitae) in title, and C++ and Python and (Software Engineer OR Architect OR Programmer) in everything.

Try the referer url.

Now, you know what to write in your resume title if you want Google HR to find you.

Jablit documentation

A small but sufficient documentation to configure, run and use jablit.