Censorship Monitoring Project API Block Rules Format

Censorship Monitoring Project API Block Rules Format

Document version: 0.2.3

This document is updated to match the running code. Feature proposals may be included here, and are identified as such.

JSON example

{
  "org-block-rules": "0.2.3",
  "version": 1390086314,
  "self-test":{
    "must-block":[ "http://www.example.com" ],
    "must-allow": ["http://www.example.net" ],
  },
  "rules": [
    {
      "category": "$2.1",
      "product": "NetScreen",
      "match": [
        "ip4:10/8,192.168/16",
        "re:body:your connection has been blocked because ([a-z]+)"
      ]
    },
    {
      "isp": "Talk Talk",
      "match": [
        "re:url:^http://www\.talktalk\.example\.com/blocked/error\.asp"
      ],
     "blocktype": ["PARENTAL"],
     "category": "querystring:urlclassname:base64"
    },
    {
      "match": [
        "status:451"
      ]
    }
  ]
}

Description

The rules file is parsed as JSON.

org-block-rules is used to both verify that this is a rules file, and to indicate the document version applying to the file format as per semver.org. If the integer part before the first . is higher than expected then the client cannot continue and should report an error. When requesting a rules update, the client will indicate the document version it was programmed against.

version is used to check if you have the latest version of the file - it's simply an integer where higher is newer. The easiest way to generate it is perhaps a Unix time stamp. If the rules file is being served raw by HTTP then this value should be used as the ETag, with :json appended.

rules have the following attributes:

match
lists the conditions that must apply for this rule to match. All conditions must apply simultaneously for the rule to match. The match types are listed below.
isp, category, product
are strings, they can include $x.y where x and y are integers, which means insert regular expression match group y from match x. All are optional. isp indicates a rule that applies to a single ISP, category is for "porn/alcohol/violence/blogs/etc", product is for rules that match specific blocking products that might be used by anyone not just a specific ISP (including corporate LANs etc).

Match strings are all type:value, where type can be:

ip4
a comma-separated list of either full IP addresses or partial addresses with a prefix length, e.g. 1.2.3.4 or 1.2/8. Any of the addresses matching counts as a match.
ip6
whatever the equivalent is for IPv6 ;-)
re
is of the format where:regexp, where where is an HTTP header name, or body to search the body for the match. The regexp can match anywhere in the relevant content, i.e. use ^ to match only at the start. Regular expressions should be as per ECMA-262 3rd edition (i.e. JavaScript regexps).
status
is either an integer to match a specific HTTP status, or a string to do a simple match on the text after the HTTP status, i.e. 451 or Unavailable for Legal Reasons.

If the client does not recognise a match type then it should discard that rule entirely but can continue to operate using the other rules that it does understand.

Within an ISP block, a category rule may be defined. This is of the form:

matchtype:parameter:<optional modifier>

The matchtype supported at the moment is "querystring", which extracts the value of the querystring parameter identified by the second segment of the category rule (matchtype). An optional third segment of the rule (modifier) indicates that an additional transformation is to be performed on the value. The only currently supported modifier is "base64", which indicates that the value should be base64 decoded before sending the result off.

All regular expression and other string matching is done case-insensitively.

Blocktype gives a value that is positionally associated with one of the match rules for a given ISP. When a rule is matched, the blocktype that goes with that match rule will be returned in the result. This allows a probe to record what the motivation for a site block was (where this can be determined programmatically). Currently supported values are "PARENTAL", meaning an ISPs parental control system, and "COPYRIGHT", where an ISP is blocking access to a file-sharing site. A future version of the config file layout will associate this values more directly (perhaps by combining them into a tuple or dictionary),

If an HTTP fetch results in a redirect, the matches are applied to both the redirect response itself and the results of the fetch of that redirect, and if that is a redirect, the results of the fetch of that, etc.

The self-test section describes a set of URLs which a probe should fetch when starting up. The URLs in the must-allow section should all be retrievable without any sign of blocking, and the ones in must-block should trigger the probe's blocking rule processor. This allows the probe to check that the line it is operating against is functional and does have some kind of blocking filter running on it. If the probe is unable to retrieve the must-allow URLs (or successfully retrieves the must-block URLs), it should exit logging an error.

XML syntax

The same data with the same semantics could also be transmitted as XML. The ETag should be the version parameter of the org-block-rules tag, with :xml appended.

XML example

<?xml version="1.0"?>
<org-block-rules format="0.2.1" version="1390086314">
  <self-test>
    <must-block><url>http://www.example.com</url></must-block>
    <must-allow><url>http://www.example.com</url></must-allow>
  </self-test>
  <rule category="$2.1" product="NetScreen">
    <match>ip4:10/8,192.168/16</match>
    <match>re:body:your connection has been blocked because ([a-z]+)</match>
  </rule>
  <rule isp="Talk Talk">
    <match>re:url:^http://www\.talktalk\.example\.com/blocked/error\.asp</match>
  </rule>
  <rule>
    <match>status:451</match>
  </rule>
</org-block-rules>

XML schema

TODO: update when JSON version has settled down a bit

<?xml version='1.0'?>
<schema xmlns='http://www.w3.org/2001/XMLSchema'>
  <element name='org-block-rules'>
    <complexType>
      <sequence minOccurs='1' maxOccurs='unbounded'>
        <element name='rule'>
          <complexType>
            <sequence minOccurs='1' maxOccurs='unbounded'>
              <element name='match'>
                <complexType mixed='true'>
                  <anyAttribute/>
                </complexType>
              </element>
              <any minOccurs='0' namespace="##other"/>
            </sequence>
            <attribute name='isp' type='string' use='optional'/>
            <attribute name='category' type='string' use='optional'/>
            <attribute name='product' type='string' use='optional'/>
            <anyAttribute/>
          </complexType>
        </element>
        <any minOccurs='0' namespace="##other"/>
      </sequence>
      <attribute name='format' type='string' use='required'/>
      <attribute name='version' type='string' use='required'/>
      <anyAttribute/>
    </complexType>
  </element>
</schema>