Hello to everyone who might be reading my post now!

So, I started my part-time job at PACC project in the beginning of February.

We firstly started by setting up the computer lab on the 4th floor. We found out that there are several machines that do not work properly or were configured in a wrong way. So we needed to check each computer to make sure everything is working.

Then, we installed Ubuntu on every machine but this time, it was decided to install Ubuntu alongside with Windows. This is just a fresh installation, so there is no authenticating system for students. Instead, they just need to work as Guest or Quiz users (the latter was created on purpose so that students had all the necessary software installed for them; you might read about this more in other students' blogs on this site). We definitely need authentication system so that each student could have his/her own account and directories but this will probably come later. In summer, I had a successful experience of getting LDAP and NFS working together. However the problem with NFS is that this file system is not distributed. User can store his files on the machine he was working on and the files would also be stored on the server. However, if the user wants to use a different machine, on login, all his files on the server will have to be copied to the new machine. Well, the files on the server and clients machines are synchronized but in case the user uses a different machine every time, copying files from the server might slow down the performance.

The solution is to use a distributed file system. I looked for different distributed file systems and what I found is that GlusterFS and HDFS (Hadoop File System) seem to be the most promising. They are also free and open source which is important! Here, let me introduce HDFS. Here it comes...

HDFS consists of connected clusters of nodes. A cluster typically has one NameNode. NameNode manages file system namespace, handles namespace operations, manages access to files by clients and maps data blocks to DataNodes which store these data blocks. In a cluster, there are also several DataNodes that manage storage on the nodes that they run on. The node that a NameNode is ran on is master node. Nodes that DataNodes are ran on are slaves. If you want to get more information about the architecture of HDFS, I recommend you to read hadoop documentation or introduction to hadoop by developerWorks.

Untitled The picture is taken from http://www.ibm.com/developerworks/library/wa-introhdfs/.

To install a hadoop cluster I advise you to follow either installation guide from hadoop documentation or this guide posted by Michael G. Noll

There were several problems I encountered during setup. The one that took me most time was that when trying to install multi-node cluster, DataNodes could not launch. As I figured out later, the problem was in machine's hosts file in "/etc/hosts". The line "127.0.1.1 your_machine's_host_name" came before "actual_network_address your_machine's_host_name". That was the reason why the address was not resolved correctly. So, just put the line with network address before 127.0.1.1 comes; or comment out 127.0.1.1 line since you don't need it. So, though there were even more problems from the problem described, there was a quick and easy fix to this :)

Comments