Censorship Monitoring Project Android app

Source code for a proof of concept is available here and can installed here

Key Concepts

Automatic ISP Detection

To reduce overhead for non-technical users and to reduce the chances of malicious users influencing ISP demographics certain identifying elements can be extracted from the phone and the SIM directly via the TelephonyManager class.

Simple example code below shows how the current network name (e.g. EE / Orange) and the SIM owner can be extracted automatically;


TelephonyManager telephonyManager =((TelephonyManager) getActivity().getSystemService(Context.TELEPHONY_SERVICE));

String mobileNet = telephonyManager.getNetworkOperatorName();

String simNet = telephonyManager.getSimOperatorName();


Similarly one can detect if the phone / tablet is currently using a WiFi network via the use of the WifiManager class.

If connection to a wireless network is identified the user can be prompted to identify their ISP, alternatively a simple GET request can be issued to a remote service with a JSON endpoint ( e.g. http://wtfismyip.com/json ) or an ORG managed service which can return additional OONI information such as AS.

Identifying Optional Filtering Level

Certain mobile phone providers provide varying levels of optional filters, 2 devices on the same network may experience different filtering behaviour.

To ensure accuracy / prevent misleading false-false positives some effort should be made to periodically ascertain the level of filtering present on the device before new unknown URLs are tested.

To achieve this a canary list of known blocked URLs for given filter levels should be maintained, when tested upon device setup (and periodically thereafter) the results should give a clear indication of the level of expected filtering on the device.

The app could either request a full canary list (legal + bandwidth ramifications) or a more cut down version that simply maintains the generic and ISP specific URLs.

Example Canary List

{

   "generic": {
       "copyright": [
           "piratebay.se",
           "xyz.com"
       ],
       "gambling": [
           "bobsgamblingemporium.com",
           "paydaygamblingloans.xxx"
       ]
   },
   "ISP WITH FILTER TYPES": {
       "FILTER NAME A": [
           "FILTER_A_TYPE_URL_1",
           "FILTER_A_TYPE_URL_2"
       ],
       "FILTER NAME B": [
           "FILTER_B_TYPE_URL_1",
           "FILTER_B_TYPE_URL_2"
       ]
   },
   "ISP WITH CATEGORIES": {
       "pornography": [
           "xxx.xxx",
           "sex.com"
       ],
       "gambling": [
           "somegamblingplace.com",
           "anothergamblingplace.com"
       ],
       "esoteric": [
           "hipsters.com"
       ]
   }

}

Uniquely Identifying Devices

An MD5 hash of the Android ANDROID_ID can be used to uniquely identify each device.

The Android ID is a 64-bit number (as a hex string) that is randomly generated when the user first sets up the device and should remain constant for the lifetime of the user's device. The value may change if a factory reset is performed on the device or by malicious users[1]

To prevent abuse the ANDROID_ID could be used as a seed / salt for generating a UUID that can be tied to a user account during registration.

Spoofing User Agents

An Android HTTP GET / POST/ HEAD etc can accept custom HTTP headers.

User Agent strings can be passed to devices along with other probe payload elements.

headRequest = new HttpHead(checkURL);

headRequest.setHeader("User-Agent", "OONI Probe");

URL Payload Delivery

Polling is expensive for battery especially when the resulting poll results in no work. With the amount of URL traffic generated by bit.ly alone in the realm of hundreds of millions per day[2] we'll never be short of work.


GCM Full Payload

Google Cloud Messaging which enables correctly authenticated servers to push payload messages to the apps.

A URL, the MD5 hash, expected header information (content type, length, etc) and response code, an urgency flag and optional HMAC salt easily fit in the permissible payload limit (4kb).

GCM Notification Payload

Because it is possible that Google could suppress delivery with flagged payloads an alternative would be a payload consisting of a simple notification that it would be good (if possible) to query an ORG server for new URLs.

This would be useful as it still saves some battery, is unlikely to be interfered with and can pass urgent requests to devices as needed. (e.g. if the job distribution servers urgently want full coverage of all ISPs for a particular URL)

Polling

To ensure that no interference is performed (assuming that ORG servers can be reached) devices can be set to poll the GCM servers directly at set intervals using the AlarmManager class.

This is the most battery hungry option but ensures the least amount of interference / points of failure.

Notifying Users

Android has a very rich notification system that allows a wide variety of information to be displayed.

For example the Bowdlerize app will show a ticker (scrolling text) on the notification bar when a new URL is received;

Expanding the notification area will show an indeterminate progress bar and some information (e.g. the MD5 hash, last poll time etc)

Once the URL has finished being resolved a ticker is sent again if the URL was possibly censored

The main notification changes to reflect the last poll time and last result set

Reducing Bandwidth / Battery Utilisation & Legal Liability (?)

HTTP HEAD

The HTTP HEAD method is defined in section 9.4 of RFC2616:

The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.

The benefit is that correctly behaving servers will return the exact same headers but the phone / tablet will not have to download the entire payload (saving bandwidth) and won't actually ever receive any bytes of the (potentially) illegal material.

This has been tested against known blocked URLs (e.g. piratebay) and are correctly interpreted as censored.

T-mobile allows the use of HEAD, but not GET on blocked URLs. This may apply to additional ISPs --Korikisulda (talk) 15:39, 15 February 2014 (GMT)

Frequency

The Android radio will stay awake for up to 30 seconds after any wakeup so constantly querying URLs even if the transfer is only a few hundred bytes will actually drain a lot of battery.

Storing a minimum permissible frequency on the task distribution backend will ensure that jobs are only dispatched to phones / tablets on a time table that they are happy with.

By using the BatteryManager class differing intervals can be allowed when the device is charging.

Discovering Censorship

Current Implementation

Currently the Bowdlerize app identifies censorship (with varying degrees of confidence) if any of the following are true;

  • A HTTP 403 or 404 response code is received
  • A header contains orbidden or blocked
  • Any of the following exceptions are erroneously thrown
    • ConnectTimeoutException
    • NoHttpResponseException
    • IOException
    • IllegalStateException

Bringing in line with OONI Specification

During the URL gathering stage headers and response codes are gathered.

These should be sent downstream to the device for comparison to further evaluate the extent of any tampering.

Process Flow

URL Aggregation Servers

Servers in data centres (especially those where ORG or ORG volunteers control BGP) are unlikely to experience filtering.

When URLs are submitted by users or harvested via 3rd party aggregation methods they are checked by the servers to ascertain a baseline for headers, content type, length and HTTP response code.

Once checked ( preferably from several locations with differing User Agent strings) the URLs can be added to a database for distribution to probes.

URL Profiling

During initial URL probing it would be useful to gather additional meta data about the URL.

  • AS Number / AS Path
  • Traceroute
  • SSL meta data (Present / key length / CA)
  • Net block
  • Registrar
  • URL arguments / fragments

Job Distribution

It is important to ensure that no given device is over used, no ISP is over represented and that user preferences are honoured.

Possible user preferences;

  • Total (Daily / Hourly) bandwidth limits
  • Interval between requests
  • Only use WiFi
  • Only query when the phone is in active use (e.g. don't wake the device up to query)
  • URL categories (Difficult to know ahead of time)
  • Word blacklists


Distribution Frequency

To ensure effective job distribution a database is maintained that contains the following meta data.

Meta Data Rationale
Last Update Time The last time the device checked in
Last Polled Time The last time a URL was dispatched to the device
Delay time The minimum delay between dispatching URLs
Probe Count Number of successful probes returned by device
ISP Name of the ISP used by this probe

Basic Pseudo Algorithm:

Candidate Devices = ((Current Time - Last Polled Time) > Delay Time)

Better Pseudo Algorithm

Candidate Devices = (((Current Time - Last Polled Time) > Delay Time) AND distinct(ISP)) ORDER BY Probe Count ASC

URL Categorisation

To enable some of the user stories and provide a deeper understanding of the issues gathering additional meta data about the filter / block would be useful.

Censored State

At a bare minimum the censor state could just be a boolean however it would better if a set of arranged values could be used

  • Explicit "Blocked" message
  • Non 200 response when a 200 response is expected
  • Erroneous Timeout
  • OK
  • Differing content type / content length

Confidence

In the event of a timeout or a non 200 response the probe cannot be confident that it is indeed being filtered.

Should the probe receive a payload with "Blocked" and "By Court Order" then the confidence is quite high.4