Bigdata processing is a one of the most popular instruments nowdays, WordPress is a one of the most populat content mamagement systems nowdays. So what will be if we will cross those popular things and try to get various interesting statistical information. How to use this statistics is up to you.
Let’s try and see what we get:
Any WordPress site can be divided into 3 logical parts: core, plugins and themes. So it would be a resonable to look at this parts separately as well.
All processed data were taken from following sources accordingly:
http://core.svn.wordpress.org/
http://plugins.svn.wordpress.org/
http://themes.svn.wordpress.org/
These sources are public SVN repositories, since it is OpenSource.
First of all: which file types are used in the WordPress? Are they text or binary?
Which file is binary and which is text?
.txt files usually are textual, it is defacto standart. But if there some special symbol will occure, does it still text or already binary? So we accepted following quick algorithm for making decision does file text or binary:
- Read first 4Kb data of the file or less (if file length < 4Kb)
- Count how many binary symbols which are not: alphanumerical, spaces, tabs, line breaks, -, =, +, ~, !, @, #, $, %, ^, &, *, (, ), _, {, }, |, \, [, ], :, ;, “, ‘, <, >, ,, ., /, `
- If binaries count is more than 0.3% of total data then we count it as binary, otherwise – textual.
WordPress core file types statistics
After some processing we got following statistical information for core:
|
Extension
|
binary
|
text
|
%, binary / overall
|
|---|---|---|---|
| .crt | 0 | 165 | 0 |
| .css | 0 | 30128 | 0 |
| .htm | 0 | 1438 | 0 |
| .html | 0 | 458 | 0 |
| .js | 0 | 57982 | 0 |
| .json | 0 | 46 | 0 |
| .md | 0 | 203 | 0 |
| .php | 0 | 114539 | 0 |
| .pot | 0 | 876 | 0 |
| .scss | 0 | 1430 | 0 |
| .svg | 0 | 1691 | 0 |
| .txt | 0 | 3920 | 0 |
| .xml | 0 | 213 | 0 |
| 0 | 220 | 0 | |
| .swf | 741 | 19 | 97.5 |
| .eot | 778 | 0 | 100 |
| .gif | 19062 | 0 | 100 |
| .gz | 201 | 0 | 100 |
| .jpg | 6360 | 0 | 100 |
| .otf | 244 | 0 | 100 |
| .png | 26731 | 0 | 100 |
| .ttf | 778 | 0 | 100 |
| .woff | 778 | 0 | 100 |
| .xap | 344 | 0 | 100 |
Most of file types prodoce expected result, except of .swf. That was cuased by the file moxieplayer.swf in versions 3.5.2, 3.6 and 3.6.1. Actually it contains following HTML:
<html><body>You are being <a href="https://raw.github.com/moxiecode/moxieplayer/master/bin-release/moxieplayer.swf">redirected</a>.</body></html>
It looks as some kind of bug during releases. It’s binary contents were restored starting from version 3.7.
Another flash file which caused such statistical offset is plupload.flash.swf. It contains “2E 2E 2E 0A” hex data which also can be interpreted as textual information.
Plugins information coming soon.
Themes infomration coming soon.
