Contact us now
+38 (095) 7990080

WordPress files statistics, binary and text file types

Bigdata processing is a one of the most popular instruments nowdays, WordPress is a one of the most populat content mamagement systems nowdays. So what will be if we will cross those popular things and try to get various interesting statistical information. How to use this statistics is up to you.

Let’s try and see what we get:

Any WordPress site can be divided into 3 logical parts: core, plugins and themes. So it would be a resonable to look at this parts separately as well.
All processed data were taken from following sources accordingly:
http://core.svn.wordpress.org/
http://plugins.svn.wordpress.org/
http://themes.svn.wordpress.org/
These sources are public SVN repositories, since it is OpenSource.

First of all: which file types are used in the WordPress? Are they text or binary?

Which file is binary and which is text?

.txt files usually are textual, it is defacto standart. But if there some special symbol will occure, does it still text or already binary? So we accepted following quick algorithm for making decision does file text or binary:

  1. Read first 4Kb data of the file or less (if file length < 4Kb)
  2. Count how many binary symbols which are not: alphanumerical, spaces, tabs, line breaks, -, =, +, ~, !, @, #, $, %, ^, &, *, (, ), _, {, }, |, \, [, ], :, ;, “, ‘, <, >, ,, ., /, `
  3. If binaries count is more than 0.3% of total data then we count it as binary, otherwise – textual.

WordPress core file types statistics

After some processing we got following statistical information for core:

Extension
binary
text
%, binary / overall
.crt 0 165 0
.css 0 30128 0
.htm 0 1438 0
.html 0 458 0
.js 0 57982 0
.json 0 46 0
.md 0 203 0
.php 0 114539 0
.pot 0 876 0
.scss 0 1430 0
.svg 0 1691 0
.txt 0 3920 0
.xml 0 213 0
0 220 0
.swf 741 19 97.5
.eot 778 0 100
.gif 19062 0 100
.gz 201 0 100
.jpg 6360 0 100
.otf 244 0 100
.png 26731 0 100
.ttf 778 0 100
.woff 778 0 100
.xap 344 0 100

Most of file types prodoce expected result, except of .swf. That was cuased by the file moxieplayer.swf in versions 3.5.2, 3.6 and 3.6.1. Actually it contains following HTML:

<html><body>You are being <a href="https://raw.github.com/moxiecode/moxieplayer/master/bin-release/moxieplayer.swf">redirected</a>.</body></html>

It looks as some kind of bug during releases. It’s binary contents were restored starting from version 3.7.

Another flash file which caused such statistical offset is plupload.flash.swf. It contains “2E 2E 2E 0A” hex data which also can be interpreted as textual information.

Plugins information coming soon.

Themes infomration coming soon.