Sunday, 8 February 2009

Announcing PorPop

Well I have kind of had this project on the go for around 10 months, and finally decided to make it available for everyone. After working a a C#/C++ AP at a hedge fund for the last 8 years....on credit derivatives, the recent dimise of work in this area, has given me some time to develop PorPop.com

The Idea is to apply some of the algorythmic principals (new my math degree would come in useful again) used in search engines to adult/porn sites, with the advent of youtube and its porn equivalent youporn, video sites are the most logical place to start. I have built 2 crawlers/web spiders, the first is customized parameter driven crawler for known sites which uses prepared Regex's to pull specific details from target pages; the second is a generic crawler which will read robots txt files and scan html links and work out backlinks and calcualte a weighted score based on relevance. All the data is written back into an Db for indexing.

The indexing component is written in C++ and creates a binary representation of text found in the pages and creates a fast index, I have tested this to 100Gb of compressed binary pages ( around 60K web sites) and it will return a set of matches in less than 100/sec. Being completely scalable through roundrobin parallel processing - this will be able to cope with infinate volumes - should porpop become popular.

Anyway started indexing on Thursday 5th Feb, and have as I am writting this indexed 170,000 videos from around 7 sites. Its been a bit slow to begin with since I am babysitting the process however, once I can productionize the code, I aim to have around 100 threads working and have 20,000 items / hour indexing capacity.

Any comments or suggestions?

4 comments: