Rock Band Group Algorithm

I often play Rock Band with my friends, but we have a tough time deciding who will play what part and for how long. Here is what we’ve come up with as requirements

  • Everyone must play every instrument an equal amount of time – To avoid hogging
  • Every time a person is on an instrument, they must play with a different combination of people – To play on other people’s strengths and avoid other’s weaknesses
  • Sitting out many songs in a row is minimized

Does anyone else have other constraints they play with?

Given these requirements, here is what we do:

  1. Lay the 4 instruments out and randomly put 4 people on them, and intersperse the remaining people in the gaps between instruments
  2. Cycle everyone through every instrument and every gap (clockwise), until we arrive back at the initial state
  3. Shuffle everyone like this:
    1. Group the people into 2 groups. Those on instruments (size 4) and those not (size n-4)
    2. Pick 4 people from the non-instruments and randomly put them on instruments
    3. If you have less than 4 people on non-instruments, select random people to stay on instruments, but change their instruments randomly
    4. Randomly permute the remaining people into the open spots
  4. Play another round and then re-cycle


  • Designate one guitar as the Guitar, and one as the Bass. We choose the one with the solo fretboard as the Guitar.
  • The singer picks the song since it is the hardest to do if you don’t know the song.
  • If you fail a song, if you are 50% you move to the next instruments. If you fail again, you add your %s together.
  • We don’t replay the same song that we already played unless we REALLY need to.

This can work for other games, like Guitar Hero or even non-music games. Just change the number 4 to however many slots you have and run it.

I’m interested in how other people do this. I like fairness, but I also like not having to have a complicated system that requires white boards, a master planner, and a dictator.

And what kind of geek would I be if I didn’t have some code to solve this problem

$ python Paul Michelle Martin Melanie Surbhi Ziga Emily
Drums:  Martin
Vocals: Ziga
 Gap:   Melanie
Guitar: Surbhi
 Gap:   Paul
Bass:   Michelle
 Gap:   Emily
Press <Enter> once you finished a round: 

Drums:  Paul
Vocals: Ziga
 Gap:   Surbhi
Guitar: Emily
 Gap:   Michelle
Bass:   Melanie
 Gap:   Martin
Press <Enter> once you finished a round: 

Drums:  Surbhi
Vocals: Martin
 Gap:   Paul
Guitar: Michelle
 Gap:   Melanie
Bass:   Emily
 Gap:   Ziga
Press <Enter> once you finished a round: q

This code doesn’t work very well with 8 people as they will always be playing with the same group, but at least their instruments will change. All other sizes should work well.

File Extensions on the Internet

td.num {
padding-left: 20px;

I had a simple question to which I couldn’t find an answer.

Which file extensions are used on the internet?

So I wrote a little program ( and a half million calls to Google later, we have some interesting data.

First, the raw data:

Top 10:


6 700 000 000


5 980 000 000


1 690 000 000


1 510 000 000


1 380 000 000


565 000 000


385 000 000


298 000 000


242 000 000


199 000 000

Some interesting facts I saw :

  • There are 1305 unused 3 letter extensions out of the possible 17,576. That is 92.5% are already used for something. (There IS a lot of junk thought, so don’t be TOO alarmed).
  • There are a lot of common extensions that I have NO idea what they are for. .e? .nhn?
  • 4x more pages are html instead of just htm.
  • PHP is beating ASP by about 2x.
  • Many servers serve HTML from image extensions, and jpg > png == gif > svg > jpeg > bmp > tiff
  • Naming is mostly not biased by first letter. The empty part is 3 letter extensions starting with y.

  • Only the top 5,000 extensions have more than 1000 pages.

Some caveats

  • This was done in October 2009, things might change. I’ll rerun it if people leave comments.
  • I only looked for extensions up to 4 letters. No numbers or funky symbols.
  • I am assuming the counts on Google’s search results are ACTUALLY correct.

If anyone makes any interesting observations with this data, please let me know and I’ll post it here. Pretty graphs are welcome as well 🙂

Hadoop Hacking on Yahoo! Ad Data

At the CMU Hackday we’re letting students play with an anonymized snapshot of our advertising data. (If you want access, email Jamie, sign something, and we’ll give you a key).

Basically, we have a cluster of EC2 machines running hadoop with the data loaded to play with. So, of course, I wanted to play.

Here is the README about the data

(1) "ydata-ysm-keyphrase-bid-imp-click-v1_0.gz" contains the following fields:

    0 day
    1 anonymized account_id
    2 rank
    3 anonymized keyphrase (expressed as list of anonymized keywords)
    4 avg bid
    5 impressions
    6 clicks


1       08bade48-1081-488f-b459-6c75d75312ae    2       2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a    100.0     2.0     0.0
29      08bade48-1081-488f-b459-6c75d75312ae    3       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     1.0     0.0
29      08bade48-1081-488f-b459-6c75d75312ae    2       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     1.0     0.0
11      08bade48-1081-488f-b459-6c75d75312ae    1       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     2.0     0.0

I like python, and hadoop streaming lets me use it for map reducing. You basically write two scripts, and which work on stdin, and stdout. Here is me making those files

mkdir money_made_rank
vim money_made_rank/
<code it>
vim money_made_rank/
<code it>

import sys
import random

for line in sys.stdin :
 line = line.strip()
 cols = line.split('\t')
 if len(cols) != 7 :
 day, account, rank, keyprase, bid, impressions, clicks = cols
 clicks = float(clicks)
 if clicks == 0 :
 bid = float(bid)
 money = clicks * bid
 print "%s\t%f" % (rank, money)

import sys
from operator import itemgetter

result = {}

for line in sys.stdin :
 line = line.strip()
 key, money = line.split('\t', 1)
 try :
  money = float(money)
  result[key] = result.get(key, 0) + money
 except :
sorted_result = sorted(result.items(), key=itemgetter(0))
for key, money in sorted_result :
 print "%s\t%f" % (key, money)

Then you should test your stuff locally. For that, we left the .gz file and I just ran :

zcat /mnt/data/ydata-ysm-keyphrase-bid-imp-click-v1_0.gz | head -n 1000 | money_made_rank/ | sort | money_made_rank/

And if it spits out

1       3540.000000
2       14604.489767
3       13516.602689
4       2668.682927
5       2250.000000
6       540.000000
7       540.000000

then you’re doing it right.

If you want to run on one machine then just take out the head -n 1000. That should take about 20 minutes to chew through all the data.

Lets move to hadoop

Once it works in the piping mode, then it is very simple to just do it on the cluster. Don’t change any files, just type :

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.18.3-14.cloudera.CH0_3-streaming.jar 
  -input /data/ydata/* 
  -output money_made_rank 
  -mapper money_made_rank/ 
  -reducer money_made_rank/ 
  -file money_made_rank/*

This will print out a whole bunch of stuff and at the end you should have a money_made_rank directory. Just print it and bask in the glory:

hadoop fs -cat 'money_made_rank/part-*' | sort -n

and it should print out :

1 14743915410.559452
2 5671857020.978109
3 3580521727.805751
4 1770068342.652887
5 1141008200.372228
6 531839794.947136
7 360624250.037246
8 266172741.734491
9 213458413.893067
10 189563018.302472

And then you can put it in a spreadsheet and make a pretty chart. Did you know half of our money comes from the #1 Search Ad?

Converting from MyISAM to InnoDB takes a long time

Wow, I didn’t think that with around 80 million rows, MySQL would take 7 hours to convert from MyISAM to InnoDB.

mysql> alter table metaward_achiever ENGINE=INNODB;
Query OK, 76756189 rows affected (6 hours 53 min 57.07 sec)
Records: 76756189  Duplicates: 0  Warnings: 0

mysql> show create table metaward_achiever;
| metaward_achiever | CREATE TABLE `metaward_achiever` (
  `modified` datetime NOT NULL,
  `created` datetime NOT NULL,
  `award_id` int(11) NOT NULL,
  `alias_id` int(11) NOT NULL,
  `count` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `metaward_achiever_award_id` (`award_id`),
  KEY `metaward_achiever_alias_id` (`alias_id`)
Posted in mysql, sql. 1 Comment »

Paul Is A Nerd

Good thing I named it, I’m no dork dweeb or nerd!

List of Guilded WoW players

I needed a list of as many World of Warcraft players as I could find. Sadly, blizzard wasn’t giving this out on wowarmory and neither were any other sites I could find. Thankfully, I’m good at web scraping.

So, I went through each region, server, guild, and character on wowjustsu and pulled out 1.75 Million characters.

World of warcraft claims over 10 Million players, wowjutsu claims 4 Million users but I could only find 1.75 Million. Maybe the rest are unguilded or their guilds aren’t listed. Either way, 1.75 Million is good enough for me, hopefully it helps you out.


From the idea of progrium I did a fun little hack last night.

(mimetype or file extension) -> icon ==

In the vein of gravatar’s simple URLs, just add the file extension, or mimetype onto the path, and you will get a good icon representing it. There are more options for choosing your icon set, size, and default, but head over to the root page to find out more.

It’s open source, so fork and fix. And if you have any ideas for good icon sets, please post a comment or email me and I’ll get them into the system.

We won hack day!

Friday was my favorite quarterly event at Yahoo, hackday! It’s the internal version of the hacku event that I help with. I want to chronicle my experience since it was my favourite hackday to-date. I’m being overly vague about the project since it might actually get shipped.

This year, I had an idea mulling in my head, talked to a few people, shared it on the internal hacker list and got a ton of great buzz. I started the ball rolling with some editorial work and design. Then our SearchMonkey community manager, Evan Goer lept on the idea, and did a ton more of the editorial work. His enthusiasm was contagious.

When hackday finally came around, Evan was definitely onboard, but it was still a daunting task. He wanted to help in any way he could so I gave him a small coding task in PHP. Lo-and-behold he pulled it off with flying colors. I think we should get huge bonus points for a project manager coding on the project 🙂 He sat next to me at the start of hackday, and went out for dinner, and then CAME BACK to finish it up. What a trooper.

Right as I started to code, one of the SearchMonkey ops guys, Brett Proctor IMed me offering his services. He told me he can barely code and had to leave that night at 9pm, so I gave him some “store and retrieve” webservice to work on. He pulled it off amazingly, and again, we should get more bonus points for an operations guy building the whole backend 🙂

Prior to hackday, in the email exchange, the SearchMonkey UI guy (notice a trend here?) Micah Alpern suggested it to one of his designers, Kara Mccain. She’s the one that did the SearchMonkey logo, so I was elated to have her onboard. She cooked up some sexy designs for the project and then passed out from the exhaustion that the designs caused 🙂 She stopped by the hack room, and then left before Evan got there, so as far as the rest of the team was concerned, she was “remote” much like Brett.

While Brett was doing his work, he showed the project to Reid Burke who then obviously wanted in. The only non-SearchMonkey guy, but we won’t hold it against him. Reid came to the hack room after Evan left, and hung out with me until the wee hours of the morning. Reid came into the project with a self-contained piece so it was easier to integrate. By the time I left at 5:00 am, Reid’s part still wasn’t working, but I guess my threats to take him off the team worked, because when I came back at noon the next day, Reid’s part magically worked. 🙂

I’d like to note that there was a handful of people in the hack room that contributed in one way or another to the hack. Matt Claypotch made up a lot of great titles, had some witty banter all night and made us a pretty picture too! Philip Tellis was Mr. Knowledgable sitting in the corner. I would ask a question to the air, Matt would say something snarky, Reid would join in, and then when it simmered down, Philip would quietly answer the question wonderfully. Oh, and Eric Wu was hammering away on the designs for all the little things to make hackday work. Thanks guys, the hack room was great!

We were presenting as #48 out of 92. #46 was my other hack with Yury Lifshits, And then #47 was Reid’s other hack. So a big showing all in a row kind of worried me a bit. Right before I was going to present, Eric Wu called a break for pizza. So I went back, got my carb and sugar rush going, and then setup to be first after the break. It was kind of nice placement, since people actually were ready to pay attention. Not to mention I was wearing my wizard hacking hat, which helped the intrigue.

I started off pointing out Jerry Yang was on his 3rd beer, so that boded well for his coercibility. 10 seconds in, I realized I didn’t clear my cookies from all the testing I just did, so nothing was working. I mentioned that noone should pay attention to the man behind the curtain as I quickly “⌘,” -> clear cookies. The presentation went well with lots of laughs and clapping. Lots of people in the audience smiled at me and gave me high fives. It was good. Brett couldn’t get the live streaming working, so he tail -f‘ed the apache logs to see what I was showing 🙂

When the judges came back we were just sitting around after Havi Hoffman gave away some old shirts. I had been to many hack days before this and hadn’t won, so I was already over the emotionally crushing experience of not winning and didn’t care anymore. When Prabhakar Raghavan announced the most innovative entry going to webnumbr, Yury and I were elated! We skipped up and got our awesome bright orange t-shirts with pride. I was just so excited I didn’t pay attention to when Ash Patel gave away the prize for the most fun hack to us! Sadly Brett and Kara couldn’t be there, but Evan and Reid and I bounded up to get our shirts (I didn’t take a second). We got our picture taken as we were hugging and it was just a great experience.

I chatted for a while with everyone and then walked back with Yury and Evan, talking about the future of our stuff. I’m sure on Monday the joy will wear off and the reality of our real jobs will set in, but maybe, just maybe, we can start another small project like SearchMonkey and see where it goes.

So yes, hack day was awesome, and I hope this tradition spreads and thrives.

Online YAML parser

So, today I needed to verify some YAML was correct, and eyeball the JSON output. I couldn’t find a good tool online that did what I wanted, so I wrote my own in about an hour.

I give you the Online YAML Parser.

It takes in YAML and outputs JSON using pyYAML. Simple, but useful. Try some examples from the 1.2 spec, or paste in your own, and let me know of any bugs.

SearchMonkey Object Examples

So, a few weeks ago we released SearchMonkey Objects that let web developers markup any of our 8 object formats for their pages. These pages included examples and explanations.

To help people understand it a bit more, I marked up a few real-life pages using our SearchMonkey objects.

(games and documents are identical to video).

So, I hope those help a bit. Try running them through the extraction tool. Let me know if you have any issues or have any more examples.