Rock Band Group Algorithm

I often play Rock Band with my friends, but we have a tough time deciding who will play what part and for how long. Here is what we’ve come up with as requirements

  • Everyone must play every instrument an equal amount of time – To avoid hogging
  • Every time a person is on an instrument, they must play with a different combination of people – To play on other people’s strengths and avoid other’s weaknesses
  • Sitting out many songs in a row is minimized

Does anyone else have other constraints they play with?

Given these requirements, here is what we do:

  1. Lay the 4 instruments out and randomly put 4 people on them, and intersperse the remaining people in the gaps between instruments
  2. Cycle everyone through every instrument and every gap (clockwise), until we arrive back at the initial state
  3. Shuffle everyone like this:
    1. Group the people into 2 groups. Those on instruments (size 4) and those not (size n-4)
    2. Pick 4 people from the non-instruments and randomly put them on instruments
    3. If you have less than 4 people on non-instruments, select random people to stay on instruments, but change their instruments randomly
    4. Randomly permute the remaining people into the open spots
  4. Play another round and then re-cycle

Notes:

  • Designate one guitar as the Guitar, and one as the Bass. We choose the one with the solo fretboard as the Guitar.
  • The singer picks the song since it is the hardest to do if you don’t know the song.
  • If you fail a song, if you are 50% you move to the next instruments. If you fail again, you add your %s together.
  • We don’t replay the same song that we already played unless we REALLY need to.

This can work for other games, like Guitar Hero or even non-music games. Just change the number 4 to however many slots you have and run it.

I’m interested in how other people do this. I like fairness, but I also like not having to have a complicated system that requires white boards, a master planner, and a dictator.

And what kind of geek would I be if I didn’t have some code to solve this problem

$ python rockband.py Paul Michelle Martin Melanie Surbhi Ziga Emily
Drums:  Martin
Vocals: Ziga
 Gap:   Melanie
Guitar: Surbhi
 Gap:   Paul
Bass:   Michelle
 Gap:   Emily
Press <Enter> once you finished a round: 

Drums:  Paul
Vocals: Ziga
 Gap:   Surbhi
Guitar: Emily
 Gap:   Michelle
Bass:   Melanie
 Gap:   Martin
Press <Enter> once you finished a round: 

Drums:  Surbhi
Vocals: Martin
 Gap:   Paul
Guitar: Michelle
 Gap:   Melanie
Bass:   Emily
 Gap:   Ziga
Press <Enter> once you finished a round: q

This code doesn’t work very well with 8 people as they will always be playing with the same group, but at least their instruments will change. All other sizes should work well.

Advertisements

File Extensions on the Internet

td.num {
text-align:right;
padding-left: 20px;
}

I had a simple question to which I couldn’t find an answer.

Which file extensions are used on the internet?

So I wrote a little program (webExtension.py) and a half million calls to Google later, we have some interesting data.

First, the raw data:

Top 10:

html

6 700 000 000

php

5 980 000 000

htm

1 690 000 000

asp

1 510 000 000

aspx

1 380 000 000

jsp

565 000 000

cfm

385 000 000

pdf

298 000 000

do

242 000 000

cgi

199 000 000

Some interesting facts I saw :

  • There are 1305 unused 3 letter extensions out of the possible 17,576. That is 92.5% are already used for something. (There IS a lot of junk thought, so don’t be TOO alarmed).
  • There are a lot of common extensions that I have NO idea what they are for. .e? .nhn?
  • 4x more pages are html instead of just htm.
  • PHP is beating ASP by about 2x.
  • Many servers serve HTML from image extensions, and jpg > png == gif > svg > jpeg > bmp > tiff
  • Naming is mostly not biased by first letter. The empty part is 3 letter extensions starting with y.

  • Only the top 5,000 extensions have more than 1000 pages.

Some caveats

  • This was done in October 2009, things might change. I’ll rerun it if people leave comments.
  • I only looked for extensions up to 4 letters. No numbers or funky symbols.
  • I am assuming the counts on Google’s search results are ACTUALLY correct.

If anyone makes any interesting observations with this data, please let me know and I’ll post it here. Pretty graphs are welcome as well 🙂

Hadoop Hacking on Yahoo! Ad Data

At the CMU Hackday we’re letting students play with an anonymized snapshot of our advertising data. (If you want access, email Jamie, sign something, and we’ll give you a key).

Basically, we have a cluster of EC2 machines running hadoop with the data loaded to play with. So, of course, I wanted to play.

Here is the README about the data

(1) "ydata-ysm-keyphrase-bid-imp-click-v1_0.gz" contains the following fields:

    0 day
    1 anonymized account_id
    2 rank
    3 anonymized keyphrase (expressed as list of anonymized keywords)
    4 avg bid
    5 impressions
    6 clicks

Snippet:

1       08bade48-1081-488f-b459-6c75d75312ae    2       2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a    100.0     2.0     0.0
29      08bade48-1081-488f-b459-6c75d75312ae    3       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     1.0     0.0
29      08bade48-1081-488f-b459-6c75d75312ae    2       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     1.0     0.0
11      08bade48-1081-488f-b459-6c75d75312ae    1       769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0     2.0     0.0

I like python, and hadoop streaming lets me use it for map reducing. You basically write two scripts, map.py and reduce.py which work on stdin, and stdout. Here is me making those files

mkdir money_made_rank
vim money_made_rank/map.py
<code it>
vim money_made_rank/reduce.py
<code it>

map.py:

#!/usr/bin/python
import sys
import random

for line in sys.stdin :
 line = line.strip()
 cols = line.split('\t')
 if len(cols) != 7 :
  continue
 day, account, rank, keyprase, bid, impressions, clicks = cols
 clicks = float(clicks)
 if clicks == 0 :
  continue
 bid = float(bid)
 money = clicks * bid
 print "%s\t%f" % (rank, money)

reduce.py:

#!/usr/bin/python
import sys
from operator import itemgetter

result = {}

for line in sys.stdin :
 line = line.strip()
 key, money = line.split('\t', 1)
 try :
  money = float(money)
  result[key] = result.get(key, 0) + money
 except :
  continue
 
sorted_result = sorted(result.items(), key=itemgetter(0))
for key, money in sorted_result :
 print "%s\t%f" % (key, money)

Then you should test your stuff locally. For that, we left the .gz file and I just ran :

zcat /mnt/data/ydata-ysm-keyphrase-bid-imp-click-v1_0.gz | head -n 1000 | money_made_rank/map.py | sort | money_made_rank/reduce.py

And if it spits out

1       3540.000000
2       14604.489767
3       13516.602689
4       2668.682927
5       2250.000000
6       540.000000
7       540.000000

then you’re doing it right.

If you want to run on one machine then just take out the head -n 1000. That should take about 20 minutes to chew through all the data.

Lets move to hadoop

Once it works in the piping mode, then it is very simple to just do it on the cluster. Don’t change any files, just type :

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-0.18.3-14.cloudera.CH0_3-streaming.jar 
  -input /data/ydata/* 
  -output money_made_rank 
  -mapper money_made_rank/map.py 
  -reducer money_made_rank/reduce.py 
  -file money_made_rank/*

This will print out a whole bunch of stuff and at the end you should have a money_made_rank directory. Just print it and bask in the glory:

hadoop fs -cat 'money_made_rank/part-*' | sort -n

and it should print out :

1 14743915410.559452
2 5671857020.978109
3 3580521727.805751
4 1770068342.652887
5 1141008200.372228
6 531839794.947136
7 360624250.037246
8 266172741.734491
9 213458413.893067
10 189563018.302472
...

And then you can put it in a spreadsheet and make a pretty chart. Did you know half of our money comes from the #1 Search Ad?

Converting from MyISAM to InnoDB takes a long time

Wow, I didn’t think that with around 80 million rows, MySQL would take 7 hours to convert from MyISAM to InnoDB.

mysql> alter table metaward_achiever ENGINE=INNODB;
Query OK, 76756189 rows affected (6 hours 53 min 57.07 sec)
Records: 76756189  Duplicates: 0  Warnings: 0

mysql> show create table metaward_achiever;
| metaward_achiever | CREATE TABLE `metaward_achiever` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `modified` datetime NOT NULL,
  `created` datetime NOT NULL,
  `award_id` int(11) NOT NULL,
  `alias_id` int(11) NOT NULL,
  `count` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `metaward_achiever_award_id` (`award_id`),
  KEY `metaward_achiever_alias_id` (`alias_id`)
) ENGINE=InnoDB AUTO_INCREMENT=77166947 DEFAULT CHARSET=utf8 |
Posted in mysql, sql. 1 Comment »

Paul Is A Nerd

Good thing I named it paulisageek.com, I’m no dork dweeb or nerd!

List of Guilded WoW players

I needed a list of as many World of Warcraft players as I could find. Sadly, blizzard wasn’t giving this out on wowarmory and neither were any other sites I could find. Thankfully, I’m good at web scraping.

So, I went through each region, server, guild, and character on wowjustsu and pulled out 1.75 Million characters.

World of warcraft claims over 10 Million players, wowjutsu claims 4 Million users but I could only find 1.75 Million. Maybe the rest are unguilded or their guilds aren’t listed. Either way, 1.75 Million is good enough for me, hopefully it helps you out.

stdicon

From the idea of progrium I did a fun little hack last night.

(mimetype or file extension) -> icon == stdicon.com

In the vein of gravatar’s simple URLs, just add the file extension, or mimetype onto the path, and you will get a good icon representing it. There are more options for choosing your icon set, size, and default, but head over to the root page to find out more.

It’s open source, so fork and fix. And if you have any ideas for good icon sets, please post a comment or email me and I’ll get them into the system.