Saturday, July 31, 2010

Carbonado, Graphs, FUSE, Merge-Join and assorted stuff

This month in tech... before that, here's a testimonial for an open source project that you can't beat:

You could've been rich - My mother

Moving on, I heard about a persistence API called Carbonado on the Voldemort forums. It's an open source project from the Amazon guys. It's a no frills (read clean and simple) layer that works with Berkley DB and JDBC. It's even blessed by the BDB guys as a nicer layer on top of BDB.

Here's a decent presentation on graph algorithms from the Hadoop summit. Not very detailed, more like best practices and hints. And here's a nice illustration of PageRank using Javascript.

An interesting thread going on between the Hotspot GC team and a HBase engineer facing some GC problems. Have a look at the new generation sizes they've used for some deployments, it was new to me.

If you want to OD on JVM options, there's a list for that too.

Some folks playing with the userland filesystem in Unix - FUSE. Voldemort, Github and all sorts of funny stuff as Filesystems. Reminds me of GDrive.

NoSQL systems are notorious for not being able to do simple Joins. Their answer is Map-Reduce. For running multi-attribute filtering, there's Merge-Join. Google's App Engine which is like a poor man's data store suggests the same (slide 30). I am skeptical of such queries that run on a cluster of machines, without any indexes, burning CPU on all machines, moving data back and forth. Can't imagine what it does to latency.

Tuesday, July 27, 2010

Overheard in a coffee shop: "A few yrs ago I was young and restless. Now I'm just old and breathless most of the time".

Sunday, July 25, 2010

Apache Cassandra for first timers

I wanted to get a feel for how Apache Cassandra works, so I downloaded and installed (just copied) the files. I decided to run the single node test. Here's what I did on my Windows 7 laptop:

1) Download and unzip the latest Cassandra zip file to some folder - D:\Dump\apache-cassandra-0.6.3
2) Open a command prompt at the main Cassandra directory and type - bin\cassandra.bat
3) That's it! You have a single node server that is running with the defaults. It creates and starts logging to some default location. In my case it was - D:\var\lib\cassandra\

Now, I wasn't too happy with the default Keyspace configuration - this is the "schema". So, I shut down the server, deleted the log directory and modified the configuration file in conf\storage-conf.xml. I simplified the Keyspace to 2 simple sub-sets - a column family called Message and a super column family called Car.

The more I look at the Cassandra column family structure, the more it reminds me of XML.


Then I started the CLI batch file to punch in some commands. I wasn't expecting this. I was really looking forward to a simple Java client program and there isn't any. So, those HBase guys were not kidding they said Cassandra does not have a simple client program. You have to use a Thrift client or some other third party client like Hector. I wasn't too eager to do that so I just went the command line way.

It seemed easy enough. It takes a few minutes to understand which one is the key name, the column family name and the super column family name. The advantage is that it's like a hierarchy of SortedMaps. Which means that the keys across records do not even have to have the same column names. Notice that there are some slight differences in the columns I've entered like - "Upgrade", "Leather seats" or "AWD" which are not there in the other records. So, there is some flexibility.


Some thought must be given to how efficient the storage is when you intend to store millions of records at the column/super column family/column family level. Search for discussions in the Cassandra-User mailing list. There are lots of such discussions and on which mode is better.

Thoughts:
1) The installation is easy but the lack of a proper client is bothersome
2) CLI looks good for key-value type of queries, but I was really interested in those queries on slices and ranges. I couldn't find anything ready made
3) Hbase and Hive along with Cloudera's Beeswax UI for running SQL-like queries are very compelling. But, have a look at the HBase installation. It doesn't look easy. That's why I decided to try Cassandra.
4) This article here, is the most succinct comparison of Cassandra and HBase

Until next time!

Tuesday, July 20, 2010

/dev/null on Windows

I was trying to run Apache ZooKeeper on Windows the other day. Getting it to run was super easy. I was more interested in running it without any file/snapshot logging.

I did ask around in the forums and I thought letting the "dataDir" directory to point to "/dev/null" would solve the problem. But being a (ahem) Windows user I couldn't quite get the "/dev/null" to work. In Windows the equivalent is "nul" but it doesn't quite work when you try to use it from Java. Some operations work, but some don't.

[Update 1:
Strangely getAbsolutePath() prepends the current directory's path but the file does not get created.
Creating nul:\abc.log throws an exception. But nul:abc.log does not and Java says its absolute path is d:\dump\nul:abc.log but the file is not there. Which means that it is indeed writing to the "null" device. I wonder what I'm missing.]

As usual, the full code is here: http://gist.github.com/484116


Here's the output and it shows what works and what doesn't:

Sunday, July 18, 2010

Years before Inception (movie)

Here's a short list of TV episodes where similar concepts of dream manipulation and recursive realities were explored decades ago:

The Avengers 1967:
Deaths Door (A series of Diplomats are drugged and forced to participate in their own nightmares. On waking up, events unfold just like the nightmare - all the way up to the Diplomatic event)

Star Trek: The Next Generation: Season 6:
Ship in a Bottle (A sentient hologram of Dr. Moriarty tricks the crew into thinking they are back on the ship after visiting the Holodeck. In actuality they are in a hologram ship inside another hologram.)

Frame of Mind (An officer is trapped and taken prisoner while on a mission. He is then drugged and his dreams are tampered with. He starts thinking that his actual life on the Enterprise was a delusion and is convinced to find closure with his delusional characters by killing them in his mind among other things)

Star Trek: Voyager: Season 2
Projections (The ships holographic doctor is convinced into believing that he is not a hologram but the actual hologram designer who has got lost in his own mental simulations on a Star base near Jupiter. He is asked to give up control of the ship to end the simulation)

Similar types of episodes in Mission Impossible Season 1 (1966) and 2.

[Update 1: Also see DreamWithinADream]

Saturday, July 17, 2010

Having 2 chins is better than 1... Not!

Monday, July 12, 2010

It occurs to me that our mind is like a snow globe and the snow is our thoughts and ideas. It is most beautiful when the snow is swirling around the glass.

Sunday, July 11, 2010

A simple (Project) Voldemort test on Windows

Here's a simple Voldemort test program. It's basically the one that ships with the project but with a few, little modifications to make it work on Windows/Cygwin.

First off, the server scripts are all Unix shell scripts. So, I had to install Cygwin on my Windows 7 laptop. Then the most essential bin/voldemort-server.sh script had to be modified a little to work with Cygwin because Java does not recognize the Cygwin mapped Path and Classpaths like /cygdrive/d/Dump/voldemort-0.81. You have to wrap paths with cygpath to map them back to the actual paths:  java -cp $(cygpath -w -p -l -a $CLASSPATH) ...

The simple Java client test program needs dist/voldemort-0.81.jar and dist/voldemort-test-0.81.jar. I had to find out about the second jar from their issues list.

That's it! Now, you can start the single node server and run the client Java program as many times as you like.

All the files are available here: http://gist.github.com/472163. Here are the snippets.

Client:

Server script:

Client output:

Monday, July 05, 2010

Bay Area at Sunrise

Waking up at 4.30 in the morning on a holiday to watch the sunrise is not something people even think of doing. Today, I did exactly that.

I was up before dawn and was on my way to Page Mill Road (off CA 280) to watch the Sun rise. This being my first time (waking up this early, I mean) I was unsure of where to go. So, I chose my regular hiking spots. I drove up Page Mill before Sun rise (5.50 AM) and was well on my way up the hills.

There was no one in sight. No bikers! Yay! I saw quite a few deer grazing next to the road and many hare/rabbits. It was peaceful, cool and smelled like wild flowers. I drove down Page mill, stopping at a few places to take pictures. At this hour, all the parks are still closed. Your best bet is to drive towards the end of Page Mill, beyond Montebello and even cross the CA 35 intersection. There is one area which leads to private property. You can pull over in front of the gate and take a few snaps quickly. It's a pity there aren't any vista points to park and enjoy the sun rise on Page Mill.

(Move your mouse over the images to see descriptions.)


Sun rise. View from Page mill road. Few miles before reaching Montebello preserve

Sun rise. View from Page mill road. Few miles before reaching Montebello preserve


Woodside and areas west of CA 280. Just a few minutes after sunrise

Sun rising. Almost the end of Page mill road

Sun rising. Almost the end of Page mill road

Then, when I reached CA 35, I continued down Page Mill for a few miles and then turned around. Went north again on CA 35. Just a mile or so from that intersection, there is a vista point overlooking the Bay Area. This is a nice spot. The sun had already risen by the time I got here.


Sun rise over the Bay Area

Sun rise over the Bay Area

It's not the ocean, it's the Bay Area!! Low level clouds

It's not the ocean, it's the Bay Area!! Low level clouds

Deja (rear) view? The future looks just like the past

Overall, it was worth it. I was back home by 8 AM! The Bay Area was still cloudy. It was a surprising discovery for me to see that the hills are actually clear of fog and clouds. It's only at sea level, where the early morning are gray and dull.

Back to CA 280. 7.30 AM and still cloudy at sea level

Sunday, July 04, 2010

Weekend at the Zoo(Keeper)

I've written about Apache ZooKeeper before, but I had never actually tried it. Only today did I get a chance to play with it.

The ZooKeeper recipes really piqued my curiosity. So after spending a few hours reading the docs, I decided to give it a try. My interest was purely the performance side of it. ZK makes it very clear in the docs that it excels under read-heavy workloads. And the more replicated servers you add, the better it gets. They were not kidding.

I have my test code here -http://gist.github.com/463930. Keep in mind that this is a simple test, perhaps even a micro benchmark. It does not even have the minimum 3 servers for a quorum. Remember, my tests were run on a new (2010) laptop with 4 hyper threads with some simple Xms/Xmx JVM settings and everything else remaining as is - default, out of the box. This is by no means a representative test. There are official numbers on the ZK wiki with tests run on a real server class machine. You should have a look at those too

Well, what can I say - it is a little slow. Even writing messages with a few bytes take a while. Granted, each write in a loop requires a network call. So, if I write a 1000 messages, it requires a 1000 remote/network calls. The CreateMode.PERSISTENT_SEQUENTIAL is very handy, like the RDBMS autogenerated-id column.

I would've liked a few more batch-oriented calls like getDataForChildren() and createIfAbsent() instead of making 2 calls first to find out the child names and then to get the actual data. But hey, I'm just trying to shoehorn it into a wrong usecase.

This is the simple test and the sample console output is further below. You can always get the full code from my Gist repo :

Console output: