ORG-tech-vols IRC meeting log 2014 04 24

(20:37:54) vasilis: Hi everyone
(20:38:23) dantheta: Hi!
(20:41:08) graphiclunarkid: Evening - gosh, I think I forgot all about this, sorry!
(20:41:26) graphiclunarkid: Should have been in my calendar but isn't for some reason.
(20:41:45) dantheta: hehe - 'tis all good. How's things?
(20:42:15) vasilis: Nice I finally found this nasty bug on the Pi probe.
(20:42:17) graphiclunarkid: Not too bad thanks. Pretty busy actually. Feels like I've been doing loads and getting nowhere though - one of those days :-\
(20:42:52) graphiclunarkid: vasilis: Squashing bugs is good :)
(20:42:59) dantheta: vasilis: I've written the new image to a memory card and will be firing up the new image after the meeting!
(20:43:23) _guy: evening all
(20:44:06) _guy: i managed to brute-force 3100 bt blocked domains this afternoon.. will be writing something up
(20:44:13) graphiclunarkid: Hey _guy - how's things?
(20:44:15) vasilis: In the meantime I got some preliminary results about BT Sky broadband connection.
(20:45:15) vasilis: _guy: which list are you using?
(20:46:17) _guy: just as an experiment, i used http://winhelp2002.mvps.org/hosts.txt with an additional list of social media sites, tested at various levels of block
(20:47:02) dantheta: _guy: Very cool!
(20:47:22) graphiclunarkid: BTW _guy, thanks for fixing us up with the details of how BT connections react when you request a blocked site. And thanks dantheta for closing the relevant github issues :)
(20:48:20) vasilis: _guy: Nice!
(20:48:50) dantheta: I've re-written the golang probe (from previous weeks) in Python to use the API config file and submit results. Needs a little fine-tuning, but it's pretty much ready to run on the A&A lines
(20:49:36) dantheta: I also had a couple more thoughts on wrapping ooni (to get URLs from ORG) and getting results sent to ooni and ORG DB at the same time.
(20:50:01) graphiclunarkid: dantheta: That's cool! I might take a look at that - I can probably python enough to read it at least ;-)
(20:50:02) _guy: are we aiming for a comprehensive list of blocked sites?
(20:50:40) dantheta: graphiclunarkid: Sure, I'll pop it on Github in the near future
(20:50:41) graphiclunarkid: _guy: In our next iteration we're just aiming to be able to test the URLs people submit to us via blocked.org.uk.
(20:51:19) dantheta: We're aiming to get a fast answer (blocked/not blocked) on as many ISPs for a supplied URL.
(20:51:25) graphiclunarkid: _guy: Longer term, though, I'd love to be able to reverse-engineer and publish the lists of blocked sites per category for each provider ;-)
(20:51:56) _guy: heh, yes. that's kinda what i had in mind today
(20:52:16) vasilis: _guy: another idea would be to create a tech. report(s), are you using any scripts for data "polishing" ?
(20:52:35) _guy: i was wondering if i could get the webapp to reveal which category a given site is blocked under, but that's a bit harder
(20:53:15) dantheta: Yeah, T-Mobile and Vodafone don't give up that information either. They use the same blocked status page whatever the site, whatever the reason
(20:53:28) _guy: today's experiment done with simple shell script i did on the fly, but I recorded all my results
(20:53:48) _guy: bt block page includes the url in the function arguments
(20:54:12) _guy: but I guess that's for onward processing, as you're permitted to add it to a whitelist
(20:54:34) _guy: manipulation of GET params is out of scope ;)
(20:54:37) dantheta: I think it also indicates which list (light/moderate/strict), but not which blocking category, is that right?
(20:54:44) _guy: yep, that's it
(20:55:14) graphiclunarkid: _guy: Your wikipedia submissions show which ISPs share back-end filtering solutions. Maybe if one such ISP categorises we can infer category for another using the same supplier? Won't work in every case but perhaps it will for some.
(20:55:32) _guy: i have incremental tests though, so i can approximate which domains are in which categories to a degree
(20:56:26) _guy: I imagine they'll use categorisation and lists from the same third party, or certainly one of a few
(20:56:30) graphiclunarkid: The only other way would be to write a system that changes the filtering options on the line as part of its tests - but my worry is that would be (a) hard and (b) a red flag to the ISP that something non-standard was happening.
(20:57:03) graphiclunarkid: BTW is our probe's User Agent string still "Claire Perry"?!
(20:57:15) _guy: that shouldn't be difficult, but i'd be inclined to agree
(20:57:17) vasilis: ??
(20:57:19) dantheta: Could be pretty time/bandwidth consuming. I seem to remember >10 categories on the BT list configuration
(20:57:43) graphiclunarkid: vasilis: Claire Perry is the British politician in charge of pushing for default-on filtering "for the children"
(20:57:55) vasilis: :)
(20:58:19) dantheta: graphiclunarkid: Not on this version - At the moment it's something pythonish (from the requests library). Should set a proper useragent.
(20:58:29) _guy: graphiclunarkid: who did i have to ask nicely for a Pi image?
(20:58:29) graphiclunarkid: dantheta: Yeah. I think our initial version should just be a yes/no answer with lines in their default configurations.
(20:58:48) graphiclunarkid: _guy that would be vasilis :)
(20:59:00) _guy: aha, cheers :)
(20:59:10) vasilis: Related to our discussion I saw that BT sky uses different way of blocking.
(20:59:11) _guy: vasilis: plz may I have a Pi image? :)
(20:59:28) graphiclunarkid: vasilis: BT and Sky are two different ISPs. Did you mean Sky Broadband?
(21:00:19) vasilis: _guy: Definitely!
(21:01:05) vasilis: _guy: Mail me with your pub key.
(21:02:11) vasilis: graphiclunarkid: British Sky
(21:02:18) vasilis: I see..
(21:02:31) _guy: vasilis: http://www.gtv8.org/guy.asc
(21:03:32) dantheta: (be back in a mo)
(21:03:37) vasilis: _guy: thx hopefully I 'll upload the image build script today in ORG's server.
(21:04:18) graphiclunarkid: vasilis: Ah ok. Yeah, I think the parent company was/is called British Sky Broadcasting Corporation.
(21:05:01) ***graphiclunarkid yoinks _guy's public key
(21:05:32) vasilis: graphiclunarkid: any news about the A&A lines?
(21:05:37) graphiclunarkid: My key is here if anyone wants it: https://richardskingdom.net/publickey.asc
(21:05:55) graphiclunarkid: vasilis: Nothing this week. Plett, you about this evening?
(21:06:50) graphiclunarkid: vasilis: Possibly A&A staff are or have been on holidays over Easter.
(21:07:18) vasilis: _guy: I need an SSH pub key..
(21:08:35) dantheta: Ah, I was going to ask about A&A. Hopefully we'll hear back soon.
(21:08:45) graphiclunarkid: vasilis: So do you have access to a Sky Broadband line now then?
(21:09:00) _guy: oh, haha, ok..
(21:09:30) vasilis: graphiclunarkid: Yep
(21:10:09) graphiclunarkid: vasilis: Cool :)
(21:10:41) vasilis: I have already some tests, alexa top10k list
(21:10:52) vasilis: and I 'll go for the 1m
(21:11:05) vasilis: but I
(21:11:26) dantheta: Outstanding! Do you have details of how their blocking solution works?
(21:11:28) graphiclunarkid: vasilis: Could you possibly let dantheta know the details of how they react when one requests a blocked page so he can write a probe config file? https://github.com/openrightsgroup/Blocking-Middleware/issues/8
(21:11:44) vasilis: Would like to work with the ooni-backend as well do some tests and check if it's working well.
(21:11:54) graphiclunarkid: Haha - great minds, dantheta!
(21:12:19) dantheta: Would never want to hesitate in getting an open issue closed!
(21:12:31) graphiclunarkid: lol!
(21:12:47) dantheta: ORG is currently hosting an ooni-backend, is that right?
(21:12:51) vasilis: Yep I 'll update ASAP I check already the ML topics about the probe config files, very good work guys!
(21:13:32) vasilis: dantheta: Yes but I haven't update the code and ooni has some significant changes.
(21:14:01) dantheta: I think the Ooni-backend is on a different server from the blocked.org.uk api, is that right?
(21:14:10) vasilis: I have started making some tests to check how the ooni backend perform locally.
(21:14:22) vasilis: dantheta: It runs on a different server.
(21:14:32) dantheta: OK, that's what I thought. Thanks!
(21:17:19) dantheta: So, I was running the golang version of the probe against t-mobile and vodafone last week, using the live API. It was running supervised for a few hours, and processed 6381 URLs, and gave us 214 blocked URLs.
(21:17:45) dantheta: I've also worked out where the URL list that already existed in the database came from.
(21:19:10) graphiclunarkid: dantheta: Interesting! Where?
(21:19:24) dantheta: I think it would be worth loading the alexa 10k into the DB at a higher priority than the social URLs that are already there. The database schema supports prioritization (though this hasn't been written into the API fetch routine yet), so we can prefer URLs that are sent from the frontend over alexa and choose alexa over social URLs.
(21:19:39) vasilis: dantheta: Which URL list are you using?
(21:20:51) dantheta: We inherited a URL list in the database that NetworkString set up. I think this URL list came from a twitter bot or firehose of some sort. It definitely has a "social" feel to it. There are a lot of facebook and meme image links in there.
(21:20:51) graphiclunarkid: In other news, ORG's video funding campaign is now 100% funded, so we have a hard deadline for the Version 2.0 milestone! https://www.indiegogo.com/projects/stop-uk-internet-censorship
(21:21:27) vasilis: Yey!
(21:21:43) vasilis: wget https://s3.amazonaws.com/alexa-static/top-1m.csv.zip && gunzip -c top-1m.csv.zip | head -n 10000 | cut -d ',' -f 2 | sed 's,^,http://,g' > alexa-top10k.$(date +%Y%m%d)
(21:22:15) vasilis: dantheta: a one line for the latest version of alexa top10k list
(21:22:54) dantheta: OK - that's a definite. I'm also a lot more comfortable having bots chew through a list like Alexa than a load of unverified deep links to random places :P
(21:23:20) graphiclunarkid: vasilis: No TLS sites in the top 10k? Or did I read that sed command wrong? Or do they usually 301 redirect from http to https?
(21:24:25) _guy: sed inserts a static http:// in front of each domain
(21:24:33) vasilis: graphiclunarkid: The list has actually no http(s) only the rank and the domain, ex: 1,google.com
(21:24:35) _guy: s/^/that/
(21:25:01) graphiclunarkid: Ah, I see. Thanks.
(21:25:34) _guy: seeing as the list is only domains, adding http:// to each may not create valid urls...
(21:25:51) _guy: useful list though... i'll feed it to BT nameservers
(21:26:13) vasilis: Actually it would be a nice idea to the same urls for https as well.
(21:26:16) graphiclunarkid: vasilis: Did you find any blocked sites in the top 10k on Sky?
(21:26:40) vasilis: _guy: why not? most of them are websites, or I miss anything here?
(21:26:55) vasilis: graphiclunarkid: Yes but I need to process the data.
(21:27:06) dantheta: _guy: I was wondering if it would be possible to get the IP addresses of the nameservers that BT are piping the blocked requests to. When I previously had access to a BT line, all of the details were wrapped up behind the home-hub.
(21:28:19) _guy: the list appears to be domains, not hosts.. so if there's no redirect for domain.com to www.domain.com (or other listenting httpd), the url will be broken
(21:29:21) _guy: dantheta: i'd wondered the same.. am having a few issues getting into my router tho /o
(21:30:00) _guy: when filtering is enabled, some modification to the router is made.. my guess is that it's the upstream nameservers
(21:30:24) vasilis: _guy: AFAIK that all of them are websites see: https://alexa.zendesk.com/hc/en-us/articles/200449744-How-are-Alexa-s-traffic-rankings-determined-
(21:30:47) _guy: ah, fairynuff
(21:31:48) graphiclunarkid: _guy: That would be interesting, since AFAIK BT intercepts DNS requests and redirects them all to its own nameservers to prevent filter circumvention of the 8.8.8.8 variety.
(21:32:13) _guy: no, you can still change your nameservers to arbitrary ones
(21:32:43) graphiclunarkid: _guy: It's probably not quite as simple as blocking / non-blocking nameservers though? If blocking is turned on do they intercept all DNS traffic even if you've specified your own name servers?
(21:32:49) _guy: so evasion is still as easy as just typing "host filth.com 8.8.8.8"
(21:32:50) dantheta: When I tried it a while back, it even intercepted DNS requests destined for servers that weren't even running a DNS service.
(21:33:12) _guy: hm
(21:33:17) _guy: ok, i'll test that out
(21:33:32) _guy: querying another NS worked fine for me yesterday
(21:33:39) graphiclunarkid: Probably reusing those expensive Websense DPI boxen they foolishly bought from Phorm ;-)
(21:33:46) dantheta: It might have changed - it has been a few months.
(21:34:16) dantheta: hehe!
(21:36:38) _guy: phorm's kit only tapped tho.. not capable of packet injection
(21:37:24) dantheta: I'm afraid I'm going to need to head off shortly. I think I saw a couple of weeks ago that the ModX form submission thing was fixed, so this weekend I'll look at grabbing URLs from there to add to the API queue. I'll also put the python probe onto github.
(21:38:52) dantheta: vasilis: There will also be a raspi running this evening :)
(21:39:09) graphiclunarkid: dantheta: Yeah, I've fixed that and submitted an upstream patch, though the project looks pretty dead.
(21:40:04) graphiclunarkid: dantheta: I need to look into what's on the ModX instance. I think Alexxx is going to be away for the duration now so it'll be up to the rest of us to get that all finished.
(21:41:06) vasilis: dantheta: Nice!
(21:41:30) graphiclunarkid: Just another couple of quick notes: waffle.io have just launched multi-repository support so I might have a go at merging our issues lists sometime next week.
(21:41:45) dantheta: I can see if I can rustle up a frontend dev, if that's any help? I have to admit, I'd never heard of ModX before ORG ...
(21:42:24) graphiclunarkid: Also still welcoming comments on the mailing list on moving infrastructure away from Github and to our own self-hosted services.
(21:43:00) graphiclunarkid: (There may be an added dimension to this now, with the recent controversy to do with github and sexism, which I intend to raise on the mailing list shortly).
(21:44:23) graphiclunarkid: Other bits and pieces can go out on the mailing list I think - so unless anyone has anything else they want to get into the meeting logs I'll take a cut here.