KodeClutz: 04/01/2012

Wednesday, April 18, 2012

Lessons in 4 years

(This article was written as a piece of advice in the final edition of Entelechy for the year, now that I will be graduating)

Never give advice — a wise man won’t need it, a fool won’t heed it.

So rather think of what follows as a crystallization of my opinions, which may serve you well. The last four years of my life have been a period of discovery and extreme change. In that I’ve either realised or read and agreed with certain qualities that will prove indispensible.

The first is self control. In college, nothing else matters! If you want to get things done, if you want to stay on the right path (subjectively of course), then nothing else will matter more than self control. Not your intelligence, your wealth, your environment or your friends. Only your will-power. You and I are living in a world where addiction is easier than ever to fall prey to. Information bombards us from everywhere, a plethora of online services crave for our attention, and in the physical world, chatter, video, liquids and powders demand our attention. These distractions will stop you from giving your time to the things that matter the most to you. Those who can stick to their course will fare better.

If you are the sort of person who compulsively says yes to everything, learn to say NO. If you are a doer, people will ask for your help for the most trivial of things, or sometimes for the most important things. Decide if you have the time to devote to it, don’t immediately leave what you are already doing. If it is not worth it, learn to say a firm, direct, NO. Having your attention over forty different things means that all of them will remain incomplete.

The second thing that will prove invaluable is self learning. You are in an information technology college, and being autodidacts is most relevant to us than to anyone else. Information Technology moves far too fast, much faster than academia, than law, than social conventions or the speed of light. An inability to teach yourself new things will quickly put you out of the competition.

Being able to learn things on your own also opens a universe of experiences, each of which can be used to replace the boredom that leads to addictions. Self learning is not just watching video tutorials of the Internet, it means being able to evaluate yourself, setting new challenges and goals and keeping yourself on the path to achieve them.

Learning on your own will also help guide you towards your passion(s). Because anything on which you are willing to spend time without anyone else telling you too, is clearly something you love doing. Choose a job you love, and you will never have to work a day in your life. ¹

Stay fit. You come to college and forget all about play. You spend evenings either poring over books, or playing computer games or wasting time lounging about. In these actions you are setting the pattern for how you will live your life. You might think, ‘I’ll just eat less’ or ‘I don’t mind getting a bit fat’, but few things are as joyful as sweating it out on lush green grass, day after day. Revel not only in the improvement of your mind, but in the miracle of your body. It may not get you a job, but it will ensure that when you walk into office, heads turn. Finally, like everything else, don’t over do it. Learn to listen to your body and your body will listen to you.

Have a relentless pursuit to rise above the mediocrity that is encouraged by our social system (‘our’ here applies all over the world, I’m not dissing on India). Are you scared of what you’ve to do to achieve your dream? Start small, a little step here, another one there. Find the right level of challenge. But always aim higher than you know you can go. Have a bucket list, and an ideas list. Both lists will start to fill up like mad and you will never be able to tick all of them off. You are far too small, and the world too big to be able to see it all, but just by breaking out of your comfort zone, you’ll have seen more of it than most people can imagine.

Finally, go and Create!

If your daily life seems poor, do not blame it; blame yourself, tell yourself that you are not poet enough to call forth its riches; for to the creator there is no poverty and no poor indifferent place
                                -- Rainer Maria Rilke

There is nothing more anti-human that the lazy consumerism that is prevalent today. I do not mean you should write a book, or create software or write a song. The smallest creations make a big difference to your happiness. True happiness can be found in a football move created on the ground, or the smile created on another face by your actions. Let your creations, and not your tastes, be what define you. Tastes only narrow and exclude people. So create. ²

I’ll leave you with 50 more things.

Confucius said that
↩
Paraphrased from why the lucky stiff
↩

Monday, April 16, 2012

Demystifying JSCrush

Some of you may have seen Philip Buchanan’s award winning. Autumn Evening entry for JS1K 2012 Love. While skimming the source I saw that he had used Aivo Paas’s JSCrush to compress the code. The JSCrush website is intriguing with the source code minified and then passed through JSCrush itself (so it can be submitted to JS1K). The JSCrush version used in the page, in <script> tags is the JSCrushed version. As a weekend project I tried to ‘reverse engineer’ JSCrush and understand it. It took me about 4 hours. What follows is a walk-through of the process.
JSCrush is a very interesting JavaScript program, liberally abusing eval(), global variables and insane levels of nesting to achieve a sort of compression.
Remember that your browser’s web development tools are indispensible for activities like this. I made extensive use of Firebug.

The de-obfuscation

I chose to start with the compressed version in the <script> tag rather than the plain text in the upper text field.
The syntax is clearly wrong for JavaScript, and it is all stuck in a string assigned to _. The part at the end is interesting though (properly formatted).
For every character in $ it is splitting _ on the character, using with to make the resulting array the scope. Then joining the pieces using the last piece and reassigning to _. For example:

_ = "HelloRWorldRCrushedRAB"
$='R'

var temp = _.split($) // => temp = ['Hello', 'World', 'Crushed', 'AB']
var last = temp.pop() // => last = 'AB',
                      //    temp = ['Hello', 'World', 'Crushed']
_ = temp.join(last)   // => _ = 'HelloABWorldABCrushed'

Remember this step, it is key to how JSCrush works. These steps are repeated for every character in $, after which the ‘decompressed’ output is the minified source code:
You can see this for yourself by putting a console.log(_) just before the eval(_).
Now we’ve a fair idea of how JSCrush is doing decompression. Compressed scripts are stored in _, decompressed using the loop and then executed using eval().
The next thing I did was to un-minify the source (manually):
One change I’ve made is the call to setTimeout(). I’ve converted it to a function to make it easier to read, and directly used the script tags innerHTML, since I had the decompressed source in the tag. The JSCrush code generates the textareas and button as part of it’s run and assumes body.children[9] to be the <script> tag with the JSCrush compressed source. Hence it replaces the eval call with the program source itself so that the inner eval() call in setTimeout() extracts the decompressed source and puts it as the value of the first textarea. It then calls L(), the JSCrush crushing function to compress the original code back, so that you get the compressed version of JSCrush in the lower text field. Mind-boggling.
The setTimeout() without a time simply causes the code to be executed after the script has finished evaluating completely.

Understanding

Now that we’ve decompressed code, it still has the scars of minification – single letter variable names and no comments. Time to start reading the code. Line 1 is just setting up the HTML for the user. Lines 2-4 is the first interesting piece. The array Q is being populated with all the ASCII characters, in reverse order! The characters \n, \r, \\, ' and " are excluded, as are \0 and DEL, so that Q has 121 characters. Rather than using a readable if statement, Aivo is using the fact that && is ‘short-circuiting’ in JavaScript. Much space saving here.
Next we come to the definition of L. Line 12 just removes blank lines, whitespace and single line comments (except those following code). It also escapes backslashes so that the code is ready to be put into a string later. This is assigned to the letters i and s. Be warned from here, in the goal of smaller size, variables frequently change their meaning to promote reuse. s is always going to point to the code, but i is used as a counter all over the place.
Next, B is half the length of the program, m is the empty string. Line 15 is where it starts getting interesting. The pattern:

encodeURI(string).replace(/%../g, 'i')

occurs thrice in the code. Its task is to get the byte length of the string rather than the number of characters that string.length gives. In ASCII there is no difference, but if there are Unicode characters, they may occupy 2-4 bytes. encodeURI will replace each byte with a ‘%xx’ code with xx being the hexadecimal byte value. Replacing this with the single letter ‘i’ will get us one ‘i’ for every byte, so that the length of the resulting string is the byte length. This was one of the many clever tricks present in the JSCrush code. They might be well known, but this is the first time I came across it.
The initialization in the for loop is only to save a byte, it does not affect the loop itself in any way. Similarly the m = c + m call can be moved to the end of the loop body. This construct will generate the decompression sequence contained in $. This for loop is then actually an infinite while loop.
Line 43 is again a trade-off of readability for size. Here it is in a cleaner form:

c = 0
i = 121
while (!c && i) {
 if (! (~s.indexOf(Q[i])))
  c = Q[i]
 --i;
}

~ is binary NOT. If Q[i] is not found in the source, then indexOf will return -1, NOT -1 = 0 and !0 === true, so that this code is actually saying:

For every character Qi in Q in reverse:
    If the source does NOT have the character in it:
        c = Qi

Or, c is set to an ASCII character that is not present in the program. Initially it will be ASCII 1, then perhaps 2 and so on. This ‘c’ is now the character that will be used to join the pieces obtained in Lines 20-32. This is one round. When all the characters have been used up, compression stops (Line 18).
Lines 12-32 basically try to find long, repetitive strings that can be replaced with a single character, to get the best compression. JSCrush follows a brute-force approach to find these segments. With single variable names, the code is a mess, so here is a cleaned up version which makes things much clearer:
Lines 9-28 try to find segments which repeat atleast twice in the code. Longer segments will give better compression, so we try all of them. For segments of length 1, we try every character in the string, for segments of length 2, we try every pair and so on. If it repeats we keep track of the segment count.
The segmentLengthBound = longestSegmentLength (B=Z) bit is interesting, and it took me some thinking to figure it out. It relies on the following facts:

The longest segment in the current source is longestSegmentLength.
Splitting by something, and then joining by a character not in the source will not lead to creation of longer segments.

So we can restrict segmentLength of the next round to segmentLengthBound.
Lines 32-41 choose the best segment to substitute in this round. The expression (R=o[i])*j-R-j-1 may seem cryptic, until you look a little later in the code where the split and join is done, and you remember how JSCrush works. R * j is the number of bytes we will remove by replacing this segment. But to join the split, we’ll need one character for every repetition, followed by one character to separate the segment suffix itself. The conditional asks if this leads to actual, and better, compression than what we already know of. If no such segment was found, we are done compressing. Otherwise we split by the segment, join the pieces by the join character and tack on the segment at the end. One round done!
Once multiple rounds have been done, the script is compressed, and only some trivial things remain. The value of B is now changed to store the quotes (double or single), based on which are fewer. Since the compressed program is stored in a string, using the quotes that appear less times means less ’\’ to substitute, each of which costs a byte. We then prepare the boilerplate, setting _ to the now compressed source, setting $ to the decompression sequence m and adding the evaluation code. The savings accomplished are announced too.
One trick I picked up in the code is forcing a certain digit view precision.

i/S * 100

would give a float percentage with many digits after the floating point. Instead multiplying by 1000, gets us two digits in the integral positions, bitwise OR-ing with 0 casts to an integer, losing the floating point digits, then dividing by 100 gets us the two digits we want.

Summary

JSCrush works by:

Finding the first unused ASCII character to act as the join
Finding the substring of the program text that gives the best space savings if its repetitions are all replaced by the ASCII character from 1.
Splitting the source on 2 and joining the pieces using 1, tacking on 2 to the end. This string replaces the original source.
Repeating 1, 2 and 3 until no more savings are possible or we’re all out of ASCII.
Wrapping the compressed source into a string, then using the list of join ASCII characters to unroll the string.
Unrolling is performed by splitting on every ASCII character used in 3, extracting the original repeated substring 2 from the split and joining the parts.

I hope this (long) post was interesting and educational. If you have feedback, a comment would be great.

Friday, April 06, 2012

Playing with Go

With Go 1 being released, I’ve been playing with the language once again. As a long weekend hack, I created a clone of the literate programming Docco tool in Go – quite naturally called Gocco.

This blog post chronicles my feelings about the language, and some rough spots I got stuck in.

I started out with a direct translation of the original CoffeeScript to Go. Go is remarkably unlike C and more like high level languages. Much of the translation was automatic and is not very different from the CoffeeScript source, except the pervasive use of []byte rather than string.

What I loved about Go was the ability to directly import and install packages from various hosting services like GitHub, Google Code and Bitbucket. This combined with the go command makes package management a breeze. What I would really like though is for Go to switch to default local-to-global search path semantics, like npm. You can do it using GOPATH, but making it the default would be better. I am also not sure how you can specify which version of your dependencies to install.

The inbuilt templating support is also a great tool, I didn’t have to hunt around for a templating library, and the syntax was very clean and similar to Jinja2 or Mustache. There seems to be some bug in the ParseFiles method though, as it kept complaining that the template was incomplete or empty, but when I manually read the same file into a string and called Parse, it worked.

The tooling support is excellent, with gofmt and go ensuring everything is ‘standardised’ and makes others code much easier to read.

Coming to the slightly wonky/undefined/undocumented/(may actually be my fault) parts…

I still haven’t really figured out Go’s package system. The go/build package says it allows introspection over packages, like figuring out the installation path (Similar to Python’s __file__). But I couldn’t get this to work with my code, especially when it was not installed. I wanted it so I could keep the HTML template and CSS in separate files, and use the package path to figure out where they were kept. I finally got very frustrated and just stuck them into a go file as multi-line strings.

Syntactically, multiple return values are a good way to signal errors without exception handling, there really should be a way to ignore the second, error, parameter. Not being able to do that, makes composing functions impossible. Similarly the unused variables and imports errors can get annoying while developing, when there is lots of stub code. There is a workaround but still! :)

Still, it was a lot of fun using Go. I originally wanted to implement Docco in C, but the lacking standard library and types support was scaring me. Now I will reach for Go when it comes to systems programming.

A thanks to Russ Ross, for an excellent Markdown implementation.

Monday, April 02, 2012

Great Indie-an Acts

(This was originally published in Entelechy Edition 34, March 2012)

With Synapse a few weeks old, the constant Raghu Dixit songs are starting to fade away from the hostel corridors. Except when we hear bands live, or are aware of an upcoming concert, Indian artists aren’t given much ear time. Bollywood dominates too much. But a growing independent music scene is flourishing in India, mainly due to rising economic levels and more people willing to pursue their dreams. Here are some great Indian artists that I’ve recently liked. Vishal Shah and Indian Music Revolution are instrumental in introducing some of them.

Advaita

I first heard of Advaita and Swarathma in ”Hindi Hein Hum”. Advaita term themselves as a ‘Rock/Eclectic/Organic Fusion’ band from Delhi and are similar to The Raghu Dixit project. Finding their music is extremely hard, both legally and illegally because they haven’t launched their music on Flyte, and everywhere else is too expensive. I’ve been resorting to streaming from Spotify, but finally caved in and ordered one of their CDs from Flipkart (low-tech maybe, but also the cheapest). Not much is known about them, but they are an octet (geeky?) who had their breakthrough when John Leckie selected them (and Swarathma) as one of the four Indian bands in the India Soundpad project.

Their debut album is Grounded in Space, released in 2009. Ghir Ghir is an amazing number on this album. Advaita makes very good use of some Indian classical vocals and raagas, while injecting western instruments and sound structures with the tabla and sarangi. In fact, Ghir Ghir has parts which wouldn’t sound out of place coming from Dream Theater.

Perhaps their best song is Mo Funk from their latest album The Silent Sea (2012). Vocals (by Ujwal Nagar) and tones that would make my (stuck-up) school music teacher very happy, layered on a background score that is like trance or house music make this 6 minute piece a must-listen. 4 minutes into the song, western vocalist and acoustic guitarist Chayan Adhikari takes over and culminates the ‘Funk’ part of the song.

Other songs to start with are Tremor (SS) and Drops of Earth (GiS).

Swarathma

I think I’ve already introduced Swarathma in the previous section :) Swarathma is a 6-member ensemble from Bangalore (out of which 2 crack PJs). They’ve featured on The Dewarists Episode 7 with Shubha Mudgal.

One of their SoundPad project compositions – Yeshu, Allah aur Krishna – goes “Sant Kabir aye dharti pe, …, jo socha tha woh reh gaya sapna …” and seems to reflect the band’s philosophy. Techies, B-school grads and other achievers, pursuing their passion and making their dreams come true. Their only album is self-titled and released in 2009.

Leader Jana Kahan Hai Mujhe is about “choices we face at each step in our lives, and we are frequently at a loss when called upon to make a decision”. In contrast to Advaita, Swarathma has restrained instruments, a slow, low chord guiding the vocals, and a splash of tabla here and there.

My other favourite is Ee Bhoomi. Kannada must be a happy language indeed, to produce gems like this one and Raghu Dixit’s Lokada Kalaji. This upbeat song describes the transformation of the Earth to Paradise (… bhoomi swarga …) as you let Swarathma wash over you.

Peter Cat Recording Co.

In a case of last but not the least, PCRC is the band I’ve actually been listening to the longest. They remind me of the 60s and 70s when Rock was being born and could be happy and everyday, not having to force melancholy or abstract ideas to be appealing. I’ve been a fan before they had a released album and were tracks on SoundCloud, and these guys doubled that by distributing both their albums ( Sinema and Wall of Want) free for download. The New Delhi quartet describes their music as “Gypsy Jazz to Ballroom Waltz to Midnight Moonlight car chase music”.

Sinema is their older and better album. All the songs have a grainy distortion throughout as if playing from vinyl, although whether this adds to their charm or not is a personal opinion. To quote Helter Skelter:

That it screams ‘SEX SEX SEX’ right under the ‘Free Download’ link for their album Sinema on the Peter Cat Recording Co. web site should be argument enough for you to go ahead and download it.

The Clown on the 22nd floor sets the mood for the album, with old hindi movie clips tacked on to the end. The album crafts a series of love affairs that end badly. Suryakant Sawhney’s rounded drawl adds warmth to the songs, while clever lyrics mingle girls and philosophy. Coming from a band obsessed with humour and subtlety closer Tokyo Vijaya’s lyrics just go ‘AAAaaaaaaaaaaaaaaaaaaaaaaaaaaaa’ against a dense instrumentation of drums and dragged out guitars ending the story inconclusively. PCRC is a band that aims big and delivers funny.

Sunday, April 01, 2012

Apache Cassandra: Iterate over all columns in a row

Recently I have been using Cassandra for one of my projects, and one of the needs is to iterate over all columns of a row. Each column represents an individual data, of type identified by row id, and keeps changing. So I can’t simply use a set of known column names. Using the setRange call on a SliceQuery and setting a large count is also not an option, since Cassandra will try to load the entire set of columns into memory. Instead I’ve written this iterator which takes a query on which row key and column family has been set, and will load columns as they are requested. By default it loads a 100 columns at a time. You could make it take the count as a parameter and all, but this works for me for now.

The one ‘problem’ with this is the removal of the last column to ensure that there are no duplicates, but still having a start point for the next query. This is because each column is independent, so you cannot ask a column who it’s next neighbour is and start the next query from there. If anybody has a tip to make it more elegant, I’d love to hear it.

KodeClutz