We have now indexed 380,000 videos, we took a bit of time getting to 380K, since we moved our database onto oracle rac and also modified our indexing to enable "phrase matching" i.e "sex with milf" if included in quotes will find a phrase anywhere in the title, video page text or file url for the phrase. We also now support stemming so words like 'fucks' also matches 'fuck'.
We have also implemented a bag feature which users can add search results items to and then come back to later. To use this just click on the icon next to the title and it will place into your local porn bag. To view your items just click on the link in the topright hand corner mybag(5).
We have speeded up our indexing so we should hit the 1/2 million mark by friday 20th Feb 09
Monday, 16 February 2009
Thursday, 12 February 2009
Another site added
After making a few changes to the search indexing, we are in the process of loading up harporn. Its not a great site as a large number of the videos are around a few minutes however since you can sort on duration this should eliminate these if required.
We have changed the indexing to provide the most relevant results based on title,url,description and any additonal tags and text found on the videos page, so you can search using terms and phrases. i.e "big boobed milf", will return most relevant first then any containing some of the next words. Like google once you get past a few pages, the relavancy of your results will diminish.
Next on the list is freudbox
Once we have passed the 1/2 million video mark, we aim to work on our gallery and text crawler so we can start indexing all types of porn, if you have any suggestions or comments, please add em here.
We have changed the indexing to provide the most relevant results based on title,url,description and any additonal tags and text found on the videos page, so you can search using terms and phrases. i.e "big boobed milf", will return most relevant first then any containing some of the next words. Like google once you get past a few pages, the relavancy of your results will diminish.
Next on the list is freudbox
Once we have passed the 1/2 million video mark, we aim to work on our gallery and text crawler so we can start indexing all types of porn, if you have any suggestions or comments, please add em here.
Tuesday, 10 February 2009
Sunday, 8 February 2009
Announcing PorPop
Well I have kind of had this project on the go for around 10 months, and finally decided to make it available for everyone. After working a a C#/C++ AP at a hedge fund for the last 8 years....on credit derivatives, the recent dimise of work in this area, has given me some time to develop PorPop.com
The Idea is to apply some of the algorythmic principals (new my math degree would come in useful again) used in search engines to adult/porn sites, with the advent of youtube and its porn equivalent youporn, video sites are the most logical place to start. I have built 2 crawlers/web spiders, the first is customized parameter driven crawler for known sites which uses prepared Regex's to pull specific details from target pages; the second is a generic crawler which will read robots txt files and scan html links and work out backlinks and calcualte a weighted score based on relevance. All the data is written back into an Db for indexing.
The indexing component is written in C++ and creates a binary representation of text found in the pages and creates a fast index, I have tested this to 100Gb of compressed binary pages ( around 60K web sites) and it will return a set of matches in less than 100/sec. Being completely scalable through roundrobin parallel processing - this will be able to cope with infinate volumes - should porpop become popular.
Anyway started indexing on Thursday 5th Feb, and have as I am writting this indexed 170,000 videos from around 7 sites. Its been a bit slow to begin with since I am babysitting the process however, once I can productionize the code, I aim to have around 100 threads working and have 20,000 items / hour indexing capacity.
Any comments or suggestions?
The Idea is to apply some of the algorythmic principals (new my math degree would come in useful again) used in search engines to adult/porn sites, with the advent of youtube and its porn equivalent youporn, video sites are the most logical place to start. I have built 2 crawlers/web spiders, the first is customized parameter driven crawler for known sites which uses prepared Regex's to pull specific details from target pages; the second is a generic crawler which will read robots txt files and scan html links and work out backlinks and calcualte a weighted score based on relevance. All the data is written back into an Db for indexing.
The indexing component is written in C++ and creates a binary representation of text found in the pages and creates a fast index, I have tested this to 100Gb of compressed binary pages ( around 60K web sites) and it will return a set of matches in less than 100/sec. Being completely scalable through roundrobin parallel processing - this will be able to cope with infinate volumes - should porpop become popular.
Anyway started indexing on Thursday 5th Feb, and have as I am writting this indexed 170,000 videos from around 7 sites. Its been a bit slow to begin with since I am babysitting the process however, once I can productionize the code, I aim to have around 100 threads working and have 20,000 items / hour indexing capacity.
Any comments or suggestions?
Subscribe to:
Posts (Atom)