Hadoop Hacking on Yahoo! Ad Data

At the CMU Hackday we’re letting students play with an anonymized snapshot of our advertising data. (If you want access, email Jamie, sign something, and we’ll give you a key).

Basically, we have a cluster of EC2 machines running hadoop with the data loaded to play with. So, of course, I wanted to play.

Here is the README about the data

(1) "ydata-ysm-keyphrase-bid-imp-click-v1_0.gz" contains the following fields:

    0 day
    1 anonymized account_id
    2 rank
    3 anonymized keyphrase (expressed as list of anonymized keywords)
    4 avg bid
    5 impressions
    6 clicks

Snippet:

1       08bade48-1081-488f-b459-6c75d75312ae    2       2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a    100.0     2.0     0.0
29      08bade48-1081-488f-b459-6c75d75312ae    3       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     1.0     0.0
29      08bade48-1081-488f-b459-6c75d75312ae    2       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     1.0     0.0
11      08bade48-1081-488f-b459-6c75d75312ae    1       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     2.0     0.0

I like python, and hadoop streaming lets me use it for map reducing. You basically write two scripts, map.py and reduce.py which work on stdin, and stdout. Here is me making those files

mkdir money_made_rank
vim money_made_rank/map.py
<code it>
vim money_made_rank/reduce.py
<code it>

map.py:

#!/usr/bin/python
import sys
import random

for line in sys.stdin :
 line = line.strip()
 cols = line.split('\t')
 if len(cols) != 7 :
  continue
 day, account, rank, keyprase, bid, impressions, clicks = cols
 clicks = float(clicks)
 if clicks == 0 :
  continue
 bid = float(bid)
 money = clicks * bid
 print "%s\t%f" % (rank, money)

reduce.py:

#!/usr/bin/python
import sys
from operator import itemgetter

result = {}

for line in sys.stdin :
 line = line.strip()
 key, money = line.split('\t', 1)
 try :
  money = float(money)
  result[key] = result.get(key, 0) + money
 except :
  continue
 
sorted_result = sorted(result.items(), key=itemgetter(0))
for key, money in sorted_result :
 print "%s\t%f" % (key, money)

Then you should test your stuff locally. For that, we left the .gz file and I just ran :

zcat /mnt/data/ydata-ysm-keyphrase-bid-imp-click-v1_0.gz | head -n 1000 | money_made_rank/map.py | sort | money_made_rank/reduce.py

And if it spits out

1       3540.000000
2       14604.489767
3       13516.602689
4       2668.682927
5       2250.000000
6       540.000000
7       540.000000

then you’re doing it right.

If you want to run on one machine then just take out the head -n 1000. That should take about 20 minutes to chew through all the data.

Lets move to hadoop

Once it works in the piping mode, then it is very simple to just do it on the cluster. Don’t change any files, just type :

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.18.3-14.cloudera.CH0_3-streaming.jar 
  -input /data/ydata/* 
  -output money_made_rank 
  -mapper money_made_rank/map.py 
  -reducer money_made_rank/reduce.py 
  -file money_made_rank/*

This will print out a whole bunch of stuff and at the end you should have a money_made_rank directory. Just print it and bask in the glory:

hadoop fs -cat 'money_made_rank/part-*' | sort -n

and it should print out :

1 14743915410.559452
2 5671857020.978109
3 3580521727.805751
4 1770068342.652887
5 1141008200.372228
6 531839794.947136
7 360624250.037246
8 266172741.734491
9 213458413.893067
10 189563018.302472
...

And then you can put it in a spreadsheet and make a pretty chart. Did you know half of our money comes from the #1 Search Ad?

Advertisements

We won hack day!

Friday was my favorite quarterly event at Yahoo, hackday! It’s the internal version of the hacku event that I help with. I want to chronicle my experience since it was my favourite hackday to-date. I’m being overly vague about the project since it might actually get shipped.

This year, I had an idea mulling in my head, talked to a few people, shared it on the internal hacker list and got a ton of great buzz. I started the ball rolling with some editorial work and design. Then our SearchMonkey community manager, Evan Goer lept on the idea, and did a ton more of the editorial work. His enthusiasm was contagious.

When hackday finally came around, Evan was definitely onboard, but it was still a daunting task. He wanted to help in any way he could so I gave him a small coding task in PHP. Lo-and-behold he pulled it off with flying colors. I think we should get huge bonus points for a project manager coding on the project 🙂 He sat next to me at the start of hackday, and went out for dinner, and then CAME BACK to finish it up. What a trooper.

Right as I started to code, one of the SearchMonkey ops guys, Brett Proctor IMed me offering his services. He told me he can barely code and had to leave that night at 9pm, so I gave him some “store and retrieve” webservice to work on. He pulled it off amazingly, and again, we should get more bonus points for an operations guy building the whole backend 🙂

Prior to hackday, in the email exchange, the SearchMonkey UI guy (notice a trend here?) Micah Alpern suggested it to one of his designers, Kara Mccain. She’s the one that did the SearchMonkey logo, so I was elated to have her onboard. She cooked up some sexy designs for the project and then passed out from the exhaustion that the designs caused 🙂 She stopped by the hack room, and then left before Evan got there, so as far as the rest of the team was concerned, she was “remote” much like Brett.

While Brett was doing his work, he showed the project to Reid Burke who then obviously wanted in. The only non-SearchMonkey guy, but we won’t hold it against him. Reid came to the hack room after Evan left, and hung out with me until the wee hours of the morning. Reid came into the project with a self-contained piece so it was easier to integrate. By the time I left at 5:00 am, Reid’s part still wasn’t working, but I guess my threats to take him off the team worked, because when I came back at noon the next day, Reid’s part magically worked. 🙂

I’d like to note that there was a handful of people in the hack room that contributed in one way or another to the hack. Matt Claypotch made up a lot of great titles, had some witty banter all night and made us a pretty picture too! Philip Tellis was Mr. Knowledgable sitting in the corner. I would ask a question to the air, Matt would say something snarky, Reid would join in, and then when it simmered down, Philip would quietly answer the question wonderfully. Oh, and Eric Wu was hammering away on the designs for all the little things to make hackday work. Thanks guys, the hack room was great!

We were presenting as #48 out of 92. #46 was my other hack with Yury Lifshits, webnumbr.com. And then #47 was Reid’s other hack. So a big showing all in a row kind of worried me a bit. Right before I was going to present, Eric Wu called a break for pizza. So I went back, got my carb and sugar rush going, and then setup to be first after the break. It was kind of nice placement, since people actually were ready to pay attention. Not to mention I was wearing my wizard hacking hat, which helped the intrigue.

I started off pointing out Jerry Yang was on his 3rd beer, so that boded well for his coercibility. 10 seconds in, I realized I didn’t clear my cookies from all the testing I just did, so nothing was working. I mentioned that noone should pay attention to the man behind the curtain as I quickly “⌘,” -> clear cookies. The presentation went well with lots of laughs and clapping. Lots of people in the audience smiled at me and gave me high fives. It was good. Brett couldn’t get the live streaming working, so he tail -f‘ed the apache logs to see what I was showing 🙂

When the judges came back we were just sitting around after Havi Hoffman gave away some old shirts. I had been to many hack days before this and hadn’t won, so I was already over the emotionally crushing experience of not winning and didn’t care anymore. When Prabhakar Raghavan announced the most innovative entry going to webnumbr, Yury and I were elated! We skipped up and got our awesome bright orange t-shirts with pride. I was just so excited I didn’t pay attention to when Ash Patel gave away the prize for the most fun hack to us! Sadly Brett and Kara couldn’t be there, but Evan and Reid and I bounded up to get our shirts (I didn’t take a second). We got our picture taken as we were hugging and it was just a great experience.

I chatted for a while with everyone and then walked back with Yury and Evan, talking about the future of our stuff. I’m sure on Monday the joy will wear off and the reality of our real jobs will set in, but maybe, just maybe, we can start another small project like SearchMonkey and see where it goes.

So yes, hack day was awesome, and I hope this tradition spreads and thrives.