Application for extracting data from [login to view URL]
The goal is to generate a list that is similar to this one:
<[login to view URL]>
The list needs to be deeper and contain more information that was is available at xedant.com. The list should be delimited and look like this:
Link to Product | Product Name | Publisher | Publisher Link | Date Added | License Cost | Trial | Downloads | Requirements | category
This list is only to contain software that is available for Windows and costs money. No freeware. This means the license needs to equal: $[login to view URL] to buy or: Free to try; $29.95 to buy. Trial software is to be included.
The requirements must include Windows.
Explanation of fields
1. Link to product. The link of the page you are on for the product.
2. Product Name. Pretty straight forward, in this case “Spyware Doctor 4??
3. Publisher: Publisher, in this case “PC Tools??
4. Publisher Link: The URL that the Publisher Link takes you to.
5. Date Added: The date added field
6. License Cost: the dollar value extracted from license. In this case the value of the field is $29.95 not $29.95 to buy. Only dollar values
7. Trial: If it includes the words “Free to Try?? then set this field value to 1. If it doesn’t have a trial, set it to 0.
8. Downloads: how many downloads.
9. Requirements: The whole value of the requirements field.
10. category: The [login to view URL] category from the left hand navigation
## Deliverables
Application for extracting data from [login to view URL]
The goal is to generate a list that is similar to this one:
<[login to view URL]>
The list needs to be deeper and contain more information that was is available at xedant.com. The list should be delimited and look like this:
Link to Product | Product Name | Publisher | Publisher Link | Date Added | License Cost | Trial | Downloads | Requirements | category
This list is only to contain software that is available for Windows and costs money. No freeware. This means the license needs to equal: $[login to view URL] to buy or: Free to try; $29.95 to buy. Trial software is to be included.
The requirements must include Windows.
Explanation of fields
1. Link to product. The link of the page you are on for the product.
2. Product Name. Pretty straight forward, in this case “Spyware Doctor 4??
3. Publisher: Publisher, in this case “PC Tools??
4. Publisher Link: The URL that the Publisher Link takes you to.
5. Date Added: The date added field
6. License Cost: the dollar value extracted from license. In this case the value of the field is $29.95 not $29.95 to buy. Only dollar values
7. Trial: If it includes the words “Free to Try?? then set this field value to 1. If it doesn’t have a trial, set it to 0.
8. Downloads: how many downloads.
9. Requirements: The whole value of the requirements field.
10. category: The [login to view URL] category from the left hand navigation
The goal is to extract every Windows title from download.com. I am open for suggestions on how best to do this, but I would use the taxonomy on the left hand side. This is also how the category will be derived. I want every title in every category.
For example: The top category is “Windows Software??. The second is a link called “Audio & Video Software.?? When you then click on “Audio Production?? you will get a list of available files for download. Right now there are a total of 438 files available for download. All files in all categories under Windows that have a cost to fully license them must be added. Remember files that are free to try must be added as well.
You would then iterate through every topic and sub-topic on the left hand navigation gathering all the files under each.
The information needs to be stored in a plain text file that is pipe delimited and has each record on a separate row.
The application must be delivered as a standalone application that will simply create a text file with this data in it.
I want to be able to adjust the number of worker processes / threads on the gui before I fire the app. It should be able to do multiple threads and process results in a timely fashion.
The application should also store its position within the taxonomy in a local file so it can resume where it left off.
I am open to suggestions if you have a way of making this spider faster and more effecient.
You can also crawl this list:
<[login to view URL]>
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
## Platform
Windows XP.