Censorship Monitoring Project Android app
Source code for a proof of concept is available here and can installed here
Key Concepts
Automatic ISP Detection
To reduce overhead for non-technical users and to reduce the chances of malicious users influencing ISP demographics certain identifying elements can be extracted from the phone and the SIM directly via the TelephonyManager class.
Simple example code below shows how the current network name (e.g. EE / Orange) and the SIM owner can be extracted automatically;
TelephonyManager telephonyManager =((TelephonyManager) getActivity().getSystemService(Context.TELEPHONY_SERVICE));
String mobileNet = telephonyManager.getNetworkOperatorName();
String simNet = telephonyManager.getSimOperatorName();
Similarly one can detect if the phone / tablet is currently using a WiFi network via the use of the WifiManager class.
If connection to a wireless network is identified the user can be prompted to identify their ISP, alternatively a simple GET request can be issued to a remote service with a JSON endpoint ( e.g. http://wtfismyip.com/json ) or an ORG managed service which can return additional OONI information such as AS.
Identifying Optional Filtering Level
Certain mobile phone providers provide varying levels of optional filters, 2 devices on the same network may experience different filtering behaviour.
To ensure accuracy / prevent misleading false-false positives some effort should be made to periodically ascertain the level of filtering present on the device before new unknown URLs are tested.
To achieve this a canary list of known blocked URLs for given filter levels should be maintained, when tested upon device setup (and periodically thereafter) the results should give a clear indication of the level of expected filtering on the device.
The app could either request a full canary list (legal + bandwidth ramifications) or a more cut down version that simply maintains the generic and ISP specific URLs.
Example Canary List
{
"generic": {
"copyright": [
"piratebay.se",
"xyz.com"
],
"gambling": [
"bobsgamblingemporium.com",
"paydaygamblingloans.xxx"
]
},
"ISP WITH FILTER TYPES": {
"FILTER NAME A": [
"FILTER_A_TYPE_URL_1",
"FILTER_A_TYPE_URL_2"
],
"FILTER NAME B": [
"FILTER_B_TYPE_URL_1",
"FILTER_B_TYPE_URL_2"
]
},
"ISP WITH CATEGORIES": {
"pornography": [
"xxx.xxx",
"sex.com"
],
"gambling": [
"somegamblingplace.com",
"anothergamblingplace.com"
],
"esoteric": [
"hipsters.com"
]
}
}
Uniquely Identifying Devices
An MD5 hash of the Android ANDROID_ID can be used to uniquely identify each device.
The Android ID is a 64-bit number (as a hex string) that is randomly generated when the user first sets up the device and should remain constant for the lifetime of the user's device. The value may change if a factory reset is performed on the device or by malicious users[1]
To prevent abuse the ANDROID_ID could be used as a seed / salt for generating a UUID that can be tied to a user account during registration.
Spoofing User Agents
An Android HTTP GET / POST/ HEAD etc can accept custom HTTP headers.
User Agent strings can be passed to devices along with other probe payload elements.
headRequest = new HttpHead(checkURL);
headRequest.setHeader("User-Agent", "OONI Probe");
URL Payload Delivery
Polling is expensive for battery especially when the resulting poll results in no work. With the amount of URL traffic generated by bit.ly alone in the realm of hundreds of millions per day[2] we'll never be short of work.
GCM Full Payload
Google Cloud Messaging which enables correctly authenticated servers to push payload messages to the apps.
A URL, the MD5 hash, expected header information (content type, length, etc) and response code, an urgency flag and optional HMAC salt easily fit in the permissible payload limit (4kb).
GCM Notification Payload
Because it is possible that Google could suppress delivery with flagged payloads an alternative would be a payload consisting of a simple notification that it would be good (if possible) to query an ORG server for new URLs.
This would be useful as it still saves some battery, is unlikely to be interfered with and can pass urgent requests to devices as needed. (e.g. if the job distribution servers urgently want full coverage of all ISPs for a particular URL)
Polling
To ensure that no interference is performed (assuming that ORG servers can be reached) devices can be set to poll the GCM servers directly at set intervals using the AlarmManager class.
This is the most battery hungry option but ensures the least amount of interference / points of failure.
Notifying Users
Android has a very rich notification system that allows a wide variety of information to be displayed.
For example the Bowdlerize app will show a ticker (scrolling text) on the notification bar when a new URL is received;
Expanding the notification area will show an indeterminate progress bar and some information (e.g. the MD5 hash, last poll time etc)
Once the URL has finished being resolved a ticker is sent again if the URL was possibly censored
The main notification changes to reflect the last poll time and last result set
Reducing Bandwidth / Battery Utilisation & Legal Liability (?)
HTTP HEAD
The HTTP HEAD method is defined in section 9.4 of RFC2616:
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
The benefit is that correctly behaving servers will return the exact same headers but the phone / tablet will not have to download the entire payload (saving bandwidth) and won't actually ever receive any bytes of the (potentially) illegal material.
This has been tested against known blocked URLs (e.g. piratebay) and are correctly interpreted as censored.
T-mobile allows the use of HEAD, but not GET on blocked URLs. This may apply to additional ISPs --Korikisulda (talk) 15:39, 15 February 2014 (GMT)
Frequency
The Android radio will stay awake for up to 30 seconds after any wakeup so constantly querying URLs even if the transfer is only a few hundred bytes will actually drain a lot of battery.
Storing a minimum permissible frequency on the task distribution backend will ensure that jobs are only dispatched to phones / tablets on a time table that they are happy with.
By using the BatteryManager class differing intervals can be allowed when the device is charging.
Discovering Censorship
Current Implementation
Currently the Bowdlerize app identifies censorship (with varying degrees of confidence) if any of the following are true;
- A HTTP 403 or 404 response code is received
- A header contains orbidden or blocked
- Any of the following exceptions are erroneously thrown
- ConnectTimeoutException
- NoHttpResponseException
- IOException
- IllegalStateException
Bringing in line with OONI Specification
During the URL gathering stage headers and response codes are gathered.
These should be sent downstream to the device for comparison to further evaluate the extent of any tampering.
Process Flow
URL Aggregation Servers
Servers in data centres (especially those where ORG or ORG volunteers control BGP) are unlikely to experience filtering.
When URLs are submitted by users or harvested via 3rd party aggregation methods they are checked by the servers to ascertain a baseline for headers, content type, length and HTTP response code.
Once checked ( preferably from several locations with differing User Agent strings) the URLs can be added to a database for distribution to probes.
URL Profiling
During initial URL probing it would be useful to gather additional meta data about the URL.
- AS Number / AS Path
- Traceroute
- SSL meta data (Present / key length / CA)
- Net block
- Registrar
- URL arguments / fragments
Job Distribution
It is important to ensure that no given device is over used, no ISP is over represented and that user preferences are honoured.
Possible user preferences;
- Total (Daily / Hourly) bandwidth limits
- Interval between requests
- Only use WiFi
- Only query when the phone is in active use (e.g. don't wake the device up to query)
- URL categories (Difficult to know ahead of time)
- Word blacklists
Distribution Frequency
To ensure effective job distribution a database is maintained that contains the following meta data.
Meta Data | Rationale |
Last Update Time | The last time the device checked in |
Last Polled Time | The last time a URL was dispatched to the device |
Delay time | The minimum delay between dispatching URLs |
Probe Count | Number of successful probes returned by device |
ISP | Name of the ISP used by this probe |
Basic Pseudo Algorithm:
Candidate Devices = ((Current Time - Last Polled Time) > Delay Time)
Better Pseudo Algorithm
Candidate Devices = (((Current Time - Last Polled Time) > Delay Time) AND distinct(ISP)) ORDER BY Probe Count ASC
URL Categorisation
To enable some of the user stories and provide a deeper understanding of the issues gathering additional meta data about the filter / block would be useful.
Censored State
At a bare minimum the censor state could just be a boolean however it would better if a set of arranged values could be used
- Explicit "Blocked" message
- Non 200 response when a 200 response is expected
- Erroneous Timeout
- OK
- Differing content type / content length
Confidence
In the event of a timeout or a non 200 response the probe cannot be confident that it is indeed being filtered.
Should the probe receive a payload with "Blocked" and "By Court Order" then the confidence is quite high.4