Installing EC2 CLI tools

Setting Up the Amazon EC2 CLI Tools on RHEL, Ubuntu, or Mac OS X

You must complete the following setup tasks before you can use the Amazon EC2 CLI tools on your own computer.

Download and Install the CLI Tools

To download and install the CLI tools

  1. Download the tools. The CLI tools are available as a .zip file on this site: Amazon EC2 CLI Tools. You can also download them with the wget utility.
    wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip
  2. (Optional) Verify that the CLI tools package has not been altered or corrupted after publication. For more information about authenticating the download before unzipping the file, see (Optional) Verify the Signature of the CLI Tools Download.
  3. Unzip the files into a suitable installation directory, such as /usr/local/ec2.
    sudo mkdir /usr/local/ec2
    sudo unzip ec2-api-tools.zip -d /usr/local/ec2
  1. Notice that the .zip file contains a folder ec2-api-tools-x.x.x.x, where x.x.x.x is the version number of the tools (for example, ec2-api-tools-1.7.0.0).

Tell the Tools Where Java Lives

The Amazon EC2 CLI tools require Java. If you don’t have Java 1.7 or later installed, download and install Java. Either a JRE or JDK installation is acceptable. To view and download JREs for a range of platforms, see Java Downloads.

Important

Instances that you launch using the Amazon Linux AMI already include Java.

The Amazon EC2 CLI read the JAVA_HOME environment variable to locate the Java runtime. This environment variable should specify the full path of the directory that contains a subdirectory named bin that contains the Java executable you installed (java.exe).

To set the JAVA_HOME environment variable on Linux/Unix and Mac OS X

  1. You can verify whether you have Java installed and where it is located using the following command:
    $ which java

    The following is example output.

    /usr/bin/java

    If the previous command does not return a location for the Java binary, you need to install Java. For help installing Java on your platform, see Java Downloads.

    To install Java on Ubuntu systems, execute the following command:

    ubuntu:~$ sudo apt-get install -y openjdk-7-jre
  2. Find the Java home directory on your system. The which java command executed earlier returns Java’s location in the $PATH environment variable, but in most cases this is a symbolic link to the actual program; symbolic links do not work for the JAVA_HOME environment variable, so you need to locate the actual binary.
    1. (Linux only) For Linux systems, you can recursively run the file command on the which java output until you find the binary.
      $ file $(which java)
      /usr/bin/java: symbolic link to `/etc/alternatives/java'

      The /usr/bin/java location is actually a link to /etc/alternatives/java, so you need to run the file command on that location to see whether that is the real binary.

      $ file /etc/alternatives/java
      /etc/alternatives/java: symbolic link to `/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java'

      This returns a new location, which is the actual binary. Verify this by running the file command on this location.

      $ file /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
      /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java: ELF 64-bit LSB executable...

      This location is the actual binary (notice that it is listed as an executable). The Java home directory is where bin/java lives; in this example, the Java home directory is/usr/lib/jvm/java-7-openjdk-amd64/jre.

    2. (Mac OS X only) For Mac OS X systems, the /usr/libexec/java_home command returns a path suitable for setting the JAVA_HOME variable.
      $ /usr/libexec/java_home
      /System/Library/Java/JavaVirtualMachines/1.7.0_55.jdk/Contents/Home
  3. Set JAVA_HOME to the full path of the Java home directory.
    1. (Linux only) For the Linux example above, set the JAVA_HOME variable to the directory where bin/java was located in Step 2.a.
      $ export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64/jre"

      Note

      If you are using Cygwin, JAVA_HOME should contain a Windows path.

    2. (Mac OS X only) For the Mac OS X example above, set the JAVA_HOME variable to $(/usr/libexec/java_home). The following command sets this variable to the output of thejava_home command; the benefit of setting the variable this way is that it updates to the correct value if you change the location of your Java installation later.
      $ export JAVA_HOME=$(/usr/libexec/java_home)
  4. You can verify your JAVA_HOME setting using this command.
    $ $JAVA_HOME/bin/java -version

    If you’ve set the environment variable correctly, the output looks something like this.

    java version "1.7.0_55"
    OpenJDK Runtime Environment (IcedTea6 2.4.7) (7u55-2.4.7-1ubuntu0.12.04.2)
    OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
  5. Add this environment variable definition to your shell start up scripts so that it is set every time you log in or spawn a new shell. The name of this startup file differs across platforms (in Mac OS X, this file is commonly called ~/.bash_profile and in Linux, it is commonly called ~/.profile), but you can find it with the following command:
    $ ls -al ~ | grep profile

    If the file does not exist, you can create it. Use your favorite text editor to open the file that is listed by the previous command, or to create a new file with that name. Then edit it to add the variable definition you set in Step 3.

  6. Verify that the variable is set properly for new shells by opening a new terminal window and testing that the variable is set with the following command.

    Note

    If the following command does not correctly display the Java version, try logging out, logging back in again, and then retrying the command.

    $ $JAVA_HOME/bin/java -version

Reference: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html#setting_up_ec2_command_linux

Posted in Uncategorized | Leave a comment

Stages in Hive

A Hive job consists of one or more stages , with dependencies between different stages. As you might expect, more complex queries will usually involve more stages and more stages usually requires more processing time to complete.
A stage could be a MapReduce job, a sampling stage, a merge stage, a limit stage, or a stage for some other task Hive needs to do. By default, Hive executes these stages one at a time, although later we’ll discuss parallel execution in Parallel Execution.

Hive converts a query into one or more stages.
Stages could be a MapReduce stage, a sampling stage, a merge stage, a limit stage, or other possible tasks Hive needs to do. By default, Hive executes these stages one at a time. However, a particular job may consist of some stages that are not dependent on each other and could be
executed in parallel, possibly allowing the overall job to complete more quickly.

However, if more stages are run simultaneously, the job may complete much faster.
Setting hive.exec.parallel to true enables parallel execution. If a job is running more stages in parallel, it will increase its cluster utilization:

<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
Hive converts a query into one or more stages, such as a map reduce stage or a move task stage . If a stage fails, Hive cleans up the process and reports the errors. If a stage succeeds, Hive executes subsequent stages until the entire job is done. Also, multiple Hive statements can be placed inside an HQL file and Hive will execute each query in sequence until the file is completely processed.

The STAGE PLAN section is verbose and complex. Stage-1 is the bulk of the processing for this job and happens via a MapReduce job. A TableScan takes the input of the table and produces a single output column number. The Group ByOperator applies the sum(number) and produces an output column _col0 (a synthesized name for an anonymous result). All this is happening on the map side of the job, under the Map Operator Tree:

STAGE
PLANS
:
Stage
:
Stage

1
Map
Reduce
Alias
->
Map
Operator
Tree
:
onecol
TableScan
alias
:
onecol
Select
Operator
expressions
:
expr
:
number
type
:
int
outputColumnNames
:
number
Group
By
Operator
aggregations
:
expr
:
sum
(
number
)
bucketGroup
:
false
mode
:
hash
outputColumnNames
:
_col0
Reduce
Output
Operator
sort
order
:
tag
:

1
value
expressions
:
expr
:
_col0
type
:
bigint

On the reduce side, under the Reduce Operator Tree, we see the same Group by Operator but this time it is applying sum on _col0. Finally, in the reducer we see the File Output Operator, which shows that the output will be text, based on the string output format:

HiveIgnoreKeyTextOutputFormat:Reduce
Operator
Tree
:
Group
By
Operator
aggregations
:
expr
:
sum
(
VALUE
.
_col0
)
bucketGroup
:
false
mode
:
mergepartial
outputColumnNames
:
_col0
Select
Operator
expressions
:
expr
:
_col0
type
:
bigint
outputColumnNames
:
_col0
File
Output
Operator
compressed
:
false
GlobalTableId
:
0
table
:
input
format
:
org
.
apache
.
hadoop
.
mapred
.
TextInputFormat
output
format
:
org
.
apache
.
hadoop
.
hive
.
ql
.
io
.
HiveIgnoreKeyTextOutputFormat
Because this job has no LIMIT clause, Stage-0 is a no-op stage:
Stage
:
Stage

0
Fetch
Operator
limit
:

1

Understanding the intricate details of how Hive parses and plans every query is not useful all of the time. However, it is a nice to have for analyzing complex or poorly performing queries, especially as we try various tuning steps. We can observe what effect these changes have at the “logical” level, in tandem with performance measurements.
When you type a query through the CLI interface, this HiveQL statement will be handled by the Driver component. The Driver connects a bunch of modules that transform the statement into MapReduce jobs to be run in Hadoop. It is importante to note that the query is not transformed in Java code in this process. Its goes direclty to MapReduce jobs. The modules involved in this process are: Parser, Semantic Analyzes, Logical Plan generator, Optimizer, Physical Plan Generator and Executor.

Prior Support for MAPJOIN

Hive supports MAPJOINs, which are well suited for this scenario — at least for dimensions small enough to fit in memory. A MAPJOIN can be invoked either through an optimizer hint:

select /*+ MAPJOIN(time_dim) */ count(*) from
store_sales join time_dim on (ss_sold_time_sk = t_time_sk)
or via auto join conversion:

set hive.auto.convert.join=true;
select count(*) from
store_sales join time_dim on (ss_sold_time_sk = t_time_sk)
MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through.

Local work:
read records via standard table scan (includes filters and projections) from source on local machine
build hashtable in memory
write hashtable to local disk
upload hashtable to dfs
add hashtable to distributed cache
Map task
read hashtable from local disk (distributed cache) into memory
match records? keys against hashtable
combine matches and write to output
No reduce task
Limitations of Current Implementation

The current MAPJOIN implementation has the following limitations:

The mapjoin operator can only handle one key at a time; that is, it can perform a multi-table join, but only if all the tables are joined on the same key. (Typical star schema joins do not fall into this category.)
Hints are cumbersome for users to apply correctly and auto conversion doesn’t have enough logic to consistently predict if a MAPJOIN will fit into memory or not.
A chain of MAPJOINs is not coalesced into a single map-only job, unless the query is written as a cascading sequence of mapjoin(table, subquery(mapjoin(table, subquery….). Auto conversion will never produce a single map-only job.
The hashtable for the mapjoin operator has to be generated for each run of the query, which involves downloading all the data to the Hive client machine as well as uploading the generated hashtable files.

Posted in Uncategorized | Leave a comment

Significant Parameters in Hive

hive.join.cache.size

Default Value: 25000
Added In:
How many rows in the joining tables (except the streaming table) should be cached in memory.
hive.map.aggr

Default Value: true
Added In:
Whether to use map-side aggregation in Hive Group By queries.

mapred.reduce.tasks

Default Value: -1
Added In: 0.1
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is “local”. Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.

hive.exec.reducers.bytes.per.reducer

Default Value: 1000000000
Added In:
Size per reducer. The default is 1G, i.e if the input size is 10G, it will use 10 reducers.

hive.exec.compress.output

Default Value: false
Added In:
This controls whether the final outputs of a query (to a local/hdfs file or a hive table) is compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress*

hive.exec.compress.intermediate

Default Value: false
Added In:
This controls whether intermediate files produced by hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress*

hive.exec.parallel

Default Value: false
Added In:
Whether to execute jobs in parallel.

Posted in Uncategorized | Leave a comment

kill hadoop job and child processes

hadoop job -kill <my_job_id>

Use pkill -f, which matches the pattern for any part of the command line

pkill -f my_pattern

pkill -f Child

Posted in Uncategorized | Leave a comment

OpenNebulla (IaaS) – Documentation

homepage: http://opennebula.org/start

blog: http://blog.opennebula.org/

Grid 5000 deploy and install:

an old page:

https://www.grid5000.fr/mediawiki/index.php/Deploying_and_Using_IaaS_Clouds_on_Grid’5000

new updated:

https://www.grid5000.fr/mediawiki/index.php/Deployment_Scripts_for_IaaS_Clouds_on_Grid%275000

for a beginner:
To test:

Opennebula vs Openstack

http://blog.opennebula.org/?p=4042

http://blog.opennebula.org/?p=4372

http://alax.me/post/22080094990/choosing-a-cloud-platform-my-impressions

http://www.linkedin.com/groups/OpenStack-vs-Eucalyptus-vs-OpenNebula-2685473.S.54382975

a master thesis (focus on Eucalyptus): https://docs.google.com/viewer?url=http://web.it.kth.se/~maguire/DEGREE-PROJECT-REPORTS/101118-Victor_Delgado-with-cover.pdf

Posted in Uncategorized | Leave a comment

Linux Tips

Set PATH

PATH=/usr:/bin/:usr/local/bin:.

This is a very important environment variable. This sets the path that the shell would be looking at when it has to execute any program. It would search in all the directories that are present in the above line. Remember that entries are separated by a ‘ : ‘ . You can add any number of directories to this list. The above 3 directories entered is just an example.

Note : The last entry in the PATH command is a ‘ . ‘ (period). This is an important addition that you could make in case it is not present on your system. The period indicates the current directory in Linux. That means whenever you type a command, Linux would search for that program in all the directories that are in its PATH. Since there is a period in the PATH, Linux would also look in the current directory for program by the name (the directory from where you execute a command). Thus whenever you execute a program which is present in the current directory (maybe some scripts you have written on your own) you don’t have to type a ‘ ./programname ‘ . You can only type ‘ programname ‘ since the current directory is already in your PATH.

Remember that the PATH variable is a very important variable. In case you want to add some particular directory to your PATH variable and in case you try typing the following

PATH =/newdirectory

This would replace the current PATH value with the new value only. What you would want is to append the new directory to the existing PATH value. For that to happen you should type

PATH=$PATH:/newdirectory

This would add the new directory to the existing PATH value. Always a $VARIABLE is substituted with the current value of the variable.

Posted in Uncategorized | Leave a comment

Some useful scripts

Include multiple jar files and run a java program

export JAR_HOME=/usr/local/hadoop-1.0.0

export JAR_LIB_HOME=/usr/local/hadoop-1.0.0/lib

for f in $JAR_HOME/*.jar
do
JAR_CLASSPATH=$JAR_CLASSPATH:$f
done

for g in $JAR_LIB_HOME/*.jar
do
JAR_CLASSPATH=$JAR_CLASSPATH:$g
done

 

export JAR_CLASSPATH

#the next line will print the JAR_CLASSPATH to the shell.
echo the classpath is $JAR_CLASSPATH

java -classpath $JAR_CLASSPATH org.apache.hadoop.tools.rumen.TraceBuilder ~/Desktop/test_rumen_output/job-trace.json ~/Desktop/test_rumen_output/topology.output ~/Desktop/test_rumen_data

Posted in Uncategorized | Leave a comment