Algorithm to extract capitalised phrases from texts in MySQL and ASP/PHP

Ditutup Dipasang Jan 1, 2010 Dibayar saat pengiriman
Ditutup Dibayar saat pengiriman

I have a website which displays a collection of text articles of around 400 words in length uploaded by users. I require a specific algorithm to analyse these texts and extract important keywords/phrases to be displayed in a tag cloud or list of keywords.

The specific algorithm I wish to use is the 'Capitalized Phrases' algorithm used by [url removed, login to view] described at [url removed, login to view]

## Deliverables

Hi there,

I have a website which displays a collection of text articles of around 400 words in length uploaded by users. I require a specific algorithm to analyse these texts and extract important keywords/phrases to be displayed in a tag cloud or list of keywords.

The specific algorithm I wish to use is the 'Capitalized Phrases' algorithm used by [url removed, login to view] described at [url removed, login to view]:

*Capitalized Phrases, or "CAPs", are people, places, events, or important topics mentioned frequently in a book. Along with our Statistically Improbable Phrases, Capitalized Phrases give you a quick glimpse into a book's contents.*

*Click on a Capitalized Phrase to view a list of books in which the phrase occurs. You can also view a list of references to the Capitalized Phrase in each book.*

*For example, if you're looking at a Sherlock Holmes mystery, you can click on "Professor Moriarty" to see a list of books that feature or mention Holmes's nemesis. You can then browse a few pages from the books or click on the [url removed, login to view] search link to read more about him.*

E.g in the sample text ([url removed, login to view]) there is a sentence:

*

On a vacation to the Krak des Chevaliers and Palmyra in the Syrian Desert, I witnessed the rich culture of the Middle Eastern people.*

The algorithm should extract the phrases: *Krak des Chevaliers,* *Syrian Desert* and *Middle Eastern* and ignore the rest of the capitalised words e.g. *Palmyra, On, I.

*The algorithm used should apply common sense English rules when searching for matches e.g. it should match phrases which contains non capitalised words in the middle of them e.g. *Krak des Chevaliers* but ignore ones which are just two capitalised words and phrases together e.g. *Syrian Desert* and *Middle Eastern* is two phrases not one. It should also not match phrases over commas or full stops (not an exhaustive list). I appreciate it may not be possible to make the algorithm perfect, so when bidding please explain any potential problems you see here.

I have a set of 900 texts which I will provide to selected bidders and the winning bidder for you to test any algorithms which you develop. It is these which I will judge the success of the algorithm, but it will need to allow for new texts being added.

The texts are stored in a MySQL database which your algorithm will need to integrate with. The texts are stored in one table, and as your algorithm is run it will need to populate another table with the phrases it finds. There will then be a third table which links both tables via their primary keys to see which texts match which phrases.

The code which runs the algorithm may be in either ASP, ASP.net or PHP and must run under IIS on windows shared hosting (so no other libraries may be installed on the server).

The code should consist of two complete scripts (plus any include files required by these scripts). One which can be run to populate the 2 database tables with the phrases and lookups to the associated text and one which will do this for a single text in the database.

Interested bidders should submit a basic idea of how their algorithm will work in psudo code (or one of the languages specified above) and must have demonstrable previous experience of text processing algorithms.

The winning bidder should also expect some time to be required for finetuning the algorithm based on my feedback.

**Additional:**

After some discussions with a bidder, I have seen the need for some sort of quantitive analysis of the algorithms produced so that the winning bidders algorithm can be considered complete by a third party if required and the coder can estimate how much feedback may be required.

The quantitive critera the algoritm must meet is to be at least 95% accurate on a number of texts (up to 20) that I select. Accuracy will be determined as the capitalised phrases returned for each text by the algoritm matching those selected by myself.

E.g. for the text provided in [url removed, login to view] I note 19 capitalised phrases:

Krak des Chevaliers

Syrian Desert

Middle Eastern

Political Science

Caicos Islands

Northern Caribbean

Magistrate's Court

Court Magistrate

Immigration Law

Labour Tribunal

Business Studies

English Literature

Fostering a Green Culture

Minister of Natural Resources

Deputy Head Boy

House Captain

Senior Editor

Current Events Editor

Top Student in Year

If your algorithm misses one of these words or only partially matches it will be considered a non match. If a phrase not listed here is matched it will be considered a non match. Out of 19 capitalised phrases 1 non match would be allowed.

Please note I may be happy to discount a non match if it makes sense e.g "Top Student in Year 12" instead of "Top Student in Year".

1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.

2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):

a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.

b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.

3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).

* * *This broadcast message was sent to all bidders on Sunday Jan 3, 2010 7:20:17 AM:

Dear bidders, I have added additional clarifications to the project description. Please review these and update your bids accordingly. Kind regards, Tom

## Platform

MySQL and ASP Classic/[url removed, login to view]

.NET ASP Teknik MySQL PHP Perancangan Perangkat Lunak Pengujian Perangkat Lunak Hosting Web Manajemen Situs Web Pengujian Situs Web

ID Proyek: #3056647

Tentang proyek

13 proposal Proyek online Aktif Jan 23, 2010

13 freelancer rata-rata menawar $175 untuk pekerjaan ini

newgroup4u

See private message.

$212.5 USD dalam 15 hari
(129 Ulasan)
7.3
gisterpages

See private message.

$170 USD dalam 15 hari
(57 Ulasan)
6.1
Eliteprog

See private message.

$85 USD dalam 15 hari
(27 Ulasan)
5.4
getitrightvw

See private message.

$637.5 USD dalam 15 hari
(31 Ulasan)
3.9
teodorstv

See private message.

$85 USD dalam 15 hari
(15 Ulasan)
3.8
quickright

See private message.

$106.25 USD dalam 15 hari
(12 Ulasan)
3.6
shahramjaved0075

See private message.

$187 USD dalam 15 hari
(11 Ulasan)
3.5
sajitvw

See private message.

$110.5 USD dalam 15 hari
(7 Ulasan)
3.5
programmerben

See private message.

$170 USD dalam 15 hari
(8 Ulasan)
3.1
govega

See private message.

$85 USD dalam 15 hari
(3 Ulasan)
1.9
itbtechs

See private message.

$212.5 USD dalam 15 hari
(1 Ulasan)
2.3
twandreih

See private message.

$127.5 USD dalam 15 hari
(0 Ulasan)
0.0
softtotime

See private message.

$85 USD dalam 15 hari
(0 Ulasan)
0.0