Algorithm to extract capitalised phrases from texts in MySQL and ASP/PHP
$100-500 USD
Dibayar saat pengiriman
I have a website which displays a collection of text articles of around 400 words in length uploaded by users. I require a specific algorithm to analyse these texts and extract important keywords/phrases to be displayed in a tag cloud or list of keywords.
The specific algorithm I wish to use is the 'Capitalized Phrases' algorithm used by [url removed, login to view] described at [url removed, login to view]
## Deliverables
Hi there,
I have a website which displays a collection of text articles of around 400 words in length uploaded by users. I require a specific algorithm to analyse these texts and extract important keywords/phrases to be displayed in a tag cloud or list of keywords.
The specific algorithm I wish to use is the 'Capitalized Phrases' algorithm used by [url removed, login to view] described at [url removed, login to view]:
*Capitalized Phrases, or "CAPs", are people, places, events, or important topics mentioned frequently in a book. Along with our Statistically Improbable Phrases, Capitalized Phrases give you a quick glimpse into a book's contents.*
*Click on a Capitalized Phrase to view a list of books in which the phrase occurs. You can also view a list of references to the Capitalized Phrase in each book.*
*For example, if you're looking at a Sherlock Holmes mystery, you can click on "Professor Moriarty" to see a list of books that feature or mention Holmes's nemesis. You can then browse a few pages from the books or click on the [url removed, login to view] search link to read more about him.*
E.g in the sample text ([url removed, login to view]) there is a sentence:
*
On a vacation to the Krak des Chevaliers and Palmyra in the Syrian Desert, I witnessed the rich culture of the Middle Eastern people.*
The algorithm should extract the phrases: *Krak des Chevaliers,* *Syrian Desert* and *Middle Eastern* and ignore the rest of the capitalised words e.g. *Palmyra, On, I.
*The algorithm used should apply common sense English rules when searching for matches e.g. it should match phrases which contains non capitalised words in the middle of them e.g. *Krak des Chevaliers* but ignore ones which are just two capitalised words and phrases together e.g. *Syrian Desert* and *Middle Eastern* is two phrases not one. It should also not match phrases over commas or full stops (not an exhaustive list). I appreciate it may not be possible to make the algorithm perfect, so when bidding please explain any potential problems you see here.
I have a set of 900 texts which I will provide to selected bidders and the winning bidder for you to test any algorithms which you develop. It is these which I will judge the success of the algorithm, but it will need to allow for new texts being added.
The texts are stored in a MySQL database which your algorithm will need to integrate with. The texts are stored in one table, and as your algorithm is run it will need to populate another table with the phrases it finds. There will then be a third table which links both tables via their primary keys to see which texts match which phrases.
The code which runs the algorithm may be in either ASP, ASP.net or PHP and must run under IIS on windows shared hosting (so no other libraries may be installed on the server).
The code should consist of two complete scripts (plus any include files required by these scripts). One which can be run to populate the 2 database tables with the phrases and lookups to the associated text and one which will do this for a single text in the database.
Interested bidders should submit a basic idea of how their algorithm will work in psudo code (or one of the languages specified above) and must have demonstrable previous experience of text processing algorithms.
The winning bidder should also expect some time to be required for finetuning the algorithm based on my feedback.
**Additional:**
After some discussions with a bidder, I have seen the need for some sort of quantitive analysis of the algorithms produced so that the winning bidders algorithm can be considered complete by a third party if required and the coder can estimate how much feedback may be required.
The quantitive critera the algoritm must meet is to be at least 95% accurate on a number of texts (up to 20) that I select. Accuracy will be determined as the capitalised phrases returned for each text by the algoritm matching those selected by myself.
E.g. for the text provided in [url removed, login to view] I note 19 capitalised phrases:
Krak des Chevaliers
Syrian Desert
Middle Eastern
Political Science
Caicos Islands
Northern Caribbean
Magistrate's Court
Court Magistrate
Immigration Law
Labour Tribunal
Business Studies
English Literature
Fostering a Green Culture
Minister of Natural Resources
Deputy Head Boy
House Captain
Senior Editor
Current Events Editor
Top Student in Year
If your algorithm misses one of these words or only partially matches it will be considered a non match. If a phrase not listed here is matched it will be considered a non match. Out of 19 capitalised phrases 1 non match would be allowed.
Please note I may be happy to discount a non match if it makes sense e.g "Top Student in Year 12" instead of "Top Student in Year".
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
* * *This broadcast message was sent to all bidders on Sunday Jan 3, 2010 7:20:17 AM:
Dear bidders, I have added additional clarifications to the project description. Please review these and update your bids accordingly. Kind regards, Tom
## Platform
MySQL and ASP Classic/[url removed, login to view]
ID Proyek: #3056647