Hadoop StreamingでUNIXコマンドを実行
やってみました。
$ hadoop/bin/hadoop jar hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -input '/input/attempt.tsv' \ -output '/output' \ -mapper "awk '{ num = split( $1, aryData, "," ); for ( i=1; i<=num; i++ ) { print aryData[i]"\t"$3 } }'" \ -reducer '/usr/bin/uniq'
エラー出た(・∀・)
java.lang.Exception: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
$ hadoop/bin/hadoop fs -rm -r /output $ hadoop/bin/hadoop jar hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -input '/input/attempt.tsv' \ -output '/output' \ -mapper /usr/bin/awk -F$'\t' '{ num = split( $1, aryData, "," ); for ( i=1; i<=num; i++ ) { print aryData[i]\"\t\"$3 } }' \ ←この行が駄目なようだが回避策がわからず断念。 -reducer '/usr/bin/uniq'
ローカルでやるといけるのに。。-mapper ""とダブルクオートで囲んどきながら中でもダブルクオート使っているからだと思います。
$ cat hadoop/input/attempt.tsv | awk '{ num = split( $1, aryData, "," ); for ( i=1; i<=num; i++ ) { print aryData[i]"\t"$3 } }' | LC_ALL=C sort | uniq a ccc aaa ccc aaa ddd b ddd
$ hadoop/bin/hadoop fs -rm -r /output $ hadoop/bin/hadoop jar hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \ -input '/input/attempt.tsv' \ -output '/output' \ -mapper '/usr/bin/perl awk.pl' \ -reducer '/usr/bin/uniq' \ -file 'hadoop/script/awk.pl' $ hadoop/bin/hadoop fs -cat /output/part-00000 a ccc aaa ccc aaa ddd aaa ccc b ddd
一応awk.pl。
#!/usr/bin/perl use strict; use warnings; while ( <> ) { chomp; my @data = split/\t/, $_; my @key = split/,/, $data[0]; for ( my $i = 0; $i < @key; $i++ ) { print $key[$i] . "\t" . $data[2] . "\n"; } }
(Mapperの部分をawkで書く方法をご存知の方がいらっしゃいましたら教えていただけますと幸いですm(_ _)m)