Find Jobs
Hire Freelancers

Parsing data files with Perl script and Java

$100-500 USD

Dibatalkan
Dibuat lebih dari 12 tahun yang lalu

$100-500 USD

Dibayar ketika dikirim
You are given one file containing 10,000-20,000 urls ([login to view URL]) and another file with the same number of lines that corresponds to its human judgement category labels ([login to view URL]). You need to run the Liblinear package which is a support vector machine implementation on this data. Specifically we want you to do the following. We ideally want you do this task in perl, but also want you to be proficient in Java, since we want to make our tests easy and runnable from the command line (and hence perl) but ideally want you to also be proficient in Java from the standpoint of pushing our optimally trained webpage classifier into product (and all our production systems run in Java). ## Deliverables You are given one file containing 10,000-20,000 urls ([login to view URL]) and another file with the same number of lines that corresponds to its human judgement category labels ([login to view URL]). You need to run the Liblinear package which is a support vector machine implementation on this data. Specifically we want you to do the following. We ideally want you do this task in perl, but also want you to be proficient in Java, since we want to make our tests easy and runnable from the command line (and hence perl) but ideally want you to also be proficient in Java from the standpoint of pushing our optimally trained webpage classifier into product (and all our production systems run in Java). 1. Parse the URLs file and create another file with the text dump of the each URL with new line characters of course eliminated so that the new file contains text of a URL on the corresponding line (ideally get all page text with HTML tags removed). Your perl script should be easily modifiable to allow treating differently terms (words) in different webpage sections. Let's call this file [login to view URL] 2. You will then do one pass over the text data and assign unique integers contiguously from 1 to N for each unique word that appears. Store this mapping in a separate file in the following format per line <integer><tab separator><word> 3. You will then do a pass over the URLs file and assign unique integers N+1 to N+M to each unique domain. (eg: the domain for [login to view URL] would be yahoo.com. You can simply store these in the format: <integer><tab separator><domain>. 4. You will then do a pass over the URLs file and assign unique integers N+ M + 1 to N+M +P to each unique sub-domain. (eg: the sub-domain for [login to view URL] would be movie.yahoo.com. You can simply store these in the format: <integer><tab separator><sub-domain>. 5. Do one pass over [login to view URL] and get a unique mapping from 1 to K where K is the number of distinct labels that you see in the file. Now load the mappings in (2), (3) and (4),(5) in main memory and do one more pass over [login to view URL] and [login to view URL] to prepare a file [login to view URL] which can be accepted directly the SVM software: [login to view URL]~cjlin/liblinear/ The data format is: <label> <feature1:value1>....<feature_i:value_i> You simply print the label by looking at the corresponding integer for this label. To create the rest of the "feature vector", you need to do the following: 1. Populate a HashMap of features and values where the value for features of category 3 and 4 would simply be 1.0 for the correct domain and sub-domain. For the word features, the value would be the number of times the word occurred. 2. Print the Feature Vector in the file. Liblinear requires that you print the features in increasing order of feature id so make sure you purge the hashmap contents in that manner. The feature vector needs to be sparse so iterate only over words that exist in this URL page. 3. Run Liblinear in default 5 fold cross validation mode and report the accuracy of this method to us. ---------------------------------- Once this is done, compare the following variants and report the above cross validation accuracy on each: 1. Normalize the whole feature vector (let x be sum of squares of all the feature values for a line. Divide each value by sqrt(x) before printing to the file) 2. Normalize only the word part of the feature vector (let x be sum of squares of all the word feature values (i.e (2)) for a line. Divide each word feature value by sqrt(x) before printing to the file). 3. There is probably some convergence parameter in the SVM. See if you can get better results by reducing it's value to 1/10 of the default. 4. The default value of the SVM C parameter is 1. See if there is any improvement if you set it to 0.1 and 10. Estimated time: 20-30 hours (including all testing)
ID Proyek: 2703108

Tentang proyek

6 proposal
Proyek remot
Aktif 12 tahun yang lalu

Ingin menghasilkan uang?

Keuntungan menawar di Freelancer

Tentukan anggaran dan garis waktu Anda
Dapatkan bayaran atas pekerjaan Anda
Uraikan proposal Anda
Gratis mendaftar dan menawar pekerjaan
6 freelancer menawar dengan rata-rata $381 USD untuk pekerjaan ini
Avatar Pengguna
See private message.
$425 USD dalam 5 hari
4,8 (58 ulasan)
6,6
6,6
Avatar Pengguna
See private message.
$400,35 USD dalam 5 hari
5,0 (76 ulasan)
6,1
6,1
Avatar Pengguna
See private message.
$400,35 USD dalam 5 hari
5,0 (112 ulasan)
6,0
6,0
Avatar Pengguna
See private message.
$425 USD dalam 5 hari
4,3 (61 ulasan)
6,1
6,1
Avatar Pengguna
See private message.
$272 USD dalam 5 hari
5,0 (106 ulasan)
5,9
5,9
Avatar Pengguna
See private message.
$361,25 USD dalam 5 hari
5,0 (41 ulasan)
5,8
5,8

Tentang klien

Bendera UNITED STATES
Mountain View, United States
5,0
230
Anggota sejak Apr 12, 2008

Verifikasi Klien

Terima kasih! Kami telah mengirim Anda email untuk mengklaim kredit gratis Anda.
Anda sesuatu yang salah saat mengirimkan Anda email. Silakan coba lagi.
Pengguna Terdaftar Total Pekerjaan Terpasang
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Memuat pratinjau
Izin diberikan untuk Geolokasi.
Sesi login Anda telah kedaluwarsa dan Anda sudah keluar. Silakan login kembali.