File Extensions on the Internet

td.num {
text-align:right;
padding-left: 20px;
}

I had a simple question to which I couldn’t find an answer.

Which file extensions are used on the internet?

So I wrote a little program (webExtension.py) and a half million calls to Google later, we have some interesting data.

First, the raw data:

Top 10:

html

6 700 000 000

php

5 980 000 000

htm

1 690 000 000

asp

1 510 000 000

aspx

1 380 000 000

jsp

565 000 000

cfm

385 000 000

pdf

298 000 000

do

242 000 000

cgi

199 000 000

Some interesting facts I saw :

  • There are 1305 unused 3 letter extensions out of the possible 17,576. That is 92.5% are already used for something. (There IS a lot of junk thought, so don’t be TOO alarmed).
  • There are a lot of common extensions that I have NO idea what they are for. .e? .nhn?
  • 4x more pages are html instead of just htm.
  • PHP is beating ASP by about 2x.
  • Many servers serve HTML from image extensions, and jpg > png == gif > svg > jpeg > bmp > tiff
  • Naming is mostly not biased by first letter. The empty part is 3 letter extensions starting with y.

  • Only the top 5,000 extensions have more than 1000 pages.

Some caveats

  • This was done in October 2009, things might change. I’ll rerun it if people leave comments.
  • I only looked for extensions up to 4 letters. No numbers or funky symbols.
  • I am assuming the counts on Google’s search results are ACTUALLY correct.

If anyone makes any interesting observations with this data, please let me know and I’ll post it here. Pretty graphs are welcome as well 🙂

Advertisements

6 Responses to “File Extensions on the Internet”

  1. Evan Says:

    Interesting stuff! Though of course it's hard to draw conclusions about PHP vs. ASP vs. JSP based only on file extensions…

  2. Paul Tarjan Says:

    Yup, but assuming that an equal number of developers for those platforms know how to use other extensions, then you can draw conclusions. You can use mod_rewrite with all of those languages (or whatever the tomcat equivalent is).

  3. K Says:

    Are you sure? JSP includes .jspx, .wss, and .do as well as .jsp, whereas I think .asp and .php don't have as many related extensions.

  4. Rasmus Says:

    K, actually there are quite a few missed PHP sites. .phtml, .php3 and .php4 would add about another 100 million pages.

  5. Paul Tarjan Says:

    I didn't mean for this to turn into a contest between languages. File extensions are a good indicator of langauges, but definitly not conclusive. And Google's numbers are very approximate.If you know which extensions belong to which languages please add them : http://stackoverflow.com/questions/1614520/common-file-extentions-for-web-programming-langauges so I can do a real analysis.Thanks!

  6. Blues Says:

    .asp just means active server pages (WSAPI/Gateway). It does not imply a language. I've written ASP pages in VBScript, JScript and ProgressScript.Same with .cgi. It only refers to the protocol for data exchange between server and script. The actual script could be perl, php, C, C++, bash, python or anything else (those are the ones that I've used).I guess the best answer for both of these is "inconclusive".


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: