Sunday, January 19, 2014

Rsync in Java - a quick (and partial) hack

Over the years I've been (mildly) fascinated by how various version control tools and file backup utilities work. Especially the core algorithm that drives many of these file send/diff/backup/de-duplication programs.

Rsync being the most widely used tools and the basis for many extensions, I naturally tried to wrap my head around it's working. But I thought the details were somewhat hazy. Maybe it was just me but I was looking for a simpler, clearer implementation of the algorithm and not a fully functioning program.

Recently, I gave it another shot. I waded through some of the material available on the interwebs and bravely set out to implement it to see how much of I had understood.

So, here is the basic implementation in Java. It may not be a faithful implementation of the paper but the gist is:

  • Create a summary out of fixed blocks of input text (original)
  • Use these blocks as reference against another text (modified)
    • This modified text is slightly different from the original text
    • Hence the assumption that the original text can be transformed to the modified text without having to send the entire modified text back
  • The modified text can now be transformed into a combination of:
    • References to those original blocks where there were no changes
    • And any differences as simple text
The code is available here and the same is embedded at the bottom of this post.

Some notes on the implementation:
  • It only handles Java Strings
  • It uses a combination of Rabin-Karp rolling hash for quick, incremental hashing of blocks and CRC32 for hash conflict resolution. In reality a much more robust hash should be used instead of CRC32
  • It assumes that the list of generated "blocks" is available on the other side to generate the patch. In reality there has to be a more clearly defined mechanism/protocol to exchange these blocks
  • The overall algorithm to identify common/repeating hashes should be smarted than this
References:
Code:
Until next time!

0 comments: