Search Engine讲解、辅导Python，Java编程设计、讲解c/c++，Java语言讲解R语言程序|解析Haskell程序

UCI Information Retrieval
Project 3 – Search Engine
This assignment is to be done in groups of 1, 2 or 3, preferably the same groups that
were in place for the crawler Project. Although this is presented as one single
project that will take until the end of the quarter to complete, internally it is
organized in 3 separate milestones, each with a specific deadline, deliverables and
score. In doing milestones #1 and #2, make sure to look at the evaluation criteria
not just of those milestones but also of milestone #3 –part of the milestones’
evaluation will be delayed until the final meeting with the TA.
You can use code that you or any classmate wrote for the previous projects. You
cannot use code written for this project by non-group-member classmates. You are
allowed to use any languages and libraries you want for text processing,
including the powerful nltk. However, you are not allowed to use text indexing
libraries such as Lucene, PyLucene, or ElasticSearch.
Goal: Implement a complete search engine.
In order to accommodate the various skill levels of students in this course, this
project comes in two flavors:
(1) Information Analyst. In this flavor, there is some programming involved, but
not much more advanced than what you already did so far. It’s a mixture of the Text
Processing project and stitching things together. You will be using a small subset of
the crawled pages. Groups where ALL students are neither CS nor SE can choose
this option.
(2) Algorithms and Data Structures Developer. In this flavor, not only there is
programming to be done, but your code needs to be able to perform well on the
entire collection of crawled pages, under harsh operating conditions. This option is
available to everyone, but groups that have at least one CS or SE student are
required to do this.
Milestones Overview
Goal Due date Deliverables Contribution
for score
#1 Produce an initial index for
the corpus
11/9 Short report
(no demo)
2.5
#2 Develop a retrieval
component
11/19 Short report
(no demo)
2.5
#3 Complete Search System 12/3 Code +
Demonstration
55
General Specifications
You will develop two separate programs: an indexer and a search component.
Indexer:
Create an inverted index for the given corpus with data structures designed by you.
Tokens: all alphanumeric sequences in the dataset.
Stop words: do not use stopping, i.e. use all words, even the frequently occurring
ones.
Stemming: use stemming for better textual matches. Suggestion: Porter stemming.
Important words: Words in bold, in headings (h1, h2, h3), and in titles should be
treated as more important than the other words.
Search:
Your program should prompt the user for a query. This doesn’t need to be a Web
interface, it can be a console prompt. At the time of the query, your program will
stem the query terms, look up your index, perform some calculations (see ranking
below) and give out the ranked list of pages that are relevant for the query, with the
most relevant on top. Pages should be identified by their URLs.
Ranking: at the very least, your ranking formula should include tf-idf scoring, and
take the important words into consideration, but you should feel free to add
additional components to this formula if you think they improve the retrieval.
Extra Credit:
Extra credit will be given for tasks that improve the quality of the retrieval and the
of the search experience. For example:
• Detect and eliminate duplicate pages. (2 points)
• Implement Page Rank, and use it in your ranking formula. (3 points)
• Implement an additional 2-gram and/or 3-gram indexing and use it during
retrieval. (2 points)
• Enhance the index with word positions and use that information for retrieval.
(2 points)
• Index anchor words for the target pages (1 point).
• Implement a Web or GUI interface instead of a console one. (2 points)
Additional Specifications for Information Analyst
Option available to all non-CS and non-SE students.
Programming skills required: Intro courses
Main challenges: HTML and JSON parsing, read/write structured information
from/to files or databases.
Corpus: a small portion of the ICS web pages (analyst.zip)
Indexer:
You can use a database to store the index, or a simple file – whatever is simpler to
you. If you store it in a file, the index is expected to be sufficiently small, so that it fits
in memory all at once.
Search interface:
The response to search queries should be less than 2 seconds.
This project is a great addition to your résumé!
Tired: “Wrote a Python script that finds words in Web pages.”
Wired: “Wrote a search engine from the ground up that is capable of handling two
thousand Web pages.”
Additional Specifications for Algorithms and Data Structures
Developer
Option available to all students, but required for CS and SE students.
Programming skills required: advanced
Main challenges: design efficient data structures, devise efficient file access,
balance memory usage and response time
Corpus: all ICS web pages (developer.zip)
Index: Your index should be stored in one or more files in the file system (no
databases).
Search interface:
The response to search queries should be less than 300ms. Ideally, close to 100ms,
or less, but you won’t be penalized if it’s higher (as long as it’s under 300ms).
Operational constraints: [UPDATED]
Typically, the cloud servers/containers that run search engines don’t have a lot of
memory, but they need to handle large amounts of data. As such, you must design
and implement your programs as if you are dealing with very large amounts of data,
so large that you cannot hold the inverted index all in memory.
Your indexer must off load the inverted index hash map from main memory to a
partial index on disk at least 3 times during index construction; those partial
indexes should be merged in the end. Optionally, after or during merging, they can
also be split into separate index files with term ranges.
Similarly, your search component must not load the entire inverted index in main
memory. Instead, it must read the postings from the index(es) files on disk.
The TAs will check that both of these things are happening.
This project is a great addition to your résumé!
Tired: “Wrote a Web search engine using ElasticSearch.”
Wired: “Wrote a Web search engine from the ground up that is capable of handling
tens of thousands of Web pages, under harsh operational constriants and having a
query response time under 300ms.”
Milestone #1
Goal: Build an index
Building the inverted index:
Now that you have been provided the HTML files to index, you may build your
inverted index off of them. The inverted index is simply a map with the token as a
key and a list of its corresponding postings. A posting is the representation of the
token’s occurrence in a document. The posting typically (not limited to) contains the
following info (you are encouraged to think of other attributes that you could add to
the index):
• The document name/id the token was found in.
• Its tf-idf score for that document
Some tips:
• When designing your inverted index, you will think about the structure of
your posting first.
• You would normally begin by implementing the code to calculate/fetch the
elements which will constitute your posting.
• Modularize. Use scripts/classes that will perform a function or a set of closely
related functions. This helps in keeping track of your progress, debugging,
and also dividing work amongst teammates if you’re in a group.
• We strongly recommend you use GitHub as a mechanism to work with your
team members on this project.
Deliverables: Submit a report (pdf) to with the following content: a table with
assorted numbers pertaining to your index. It should have, at least the number of
documents, the number of [unique] tokens, and the total size (in KB) of your index
on disk.
Note for the developer option: at this time, you don’t need to have the optimized
index.
Evaluation criteria:
- Did your report show up on time?
- Are the reported numbers plausible?
Milestone #2
Goal: Develop a search and retrieval component
At least the following queries should be used to test your retrieval:
1 – cristina lopes
2 - machine learning
3 – ACM
4 – master of software engineering
Developing the Search component:
Once you have built the inverted index, you are ready to test document retrieval
with queries. At the very least, the documents retrieved should be returned based
on tf-idf scoring. This can be done using the cosine similarity method. Feel free to
use a library to compute cosine similarity once you have the term frequencies and
inverse document frequencies. You may add other weighting/scoring mechanisms
to help refine the search results.
Deliverables: Submit a report (pdf) to with the following content:
o the top 5 URLs for each of the queries above
o a picture of your search interface in action
Note for the developer option: at this time, you don’t need to have the optimized
index.
Evaluation criteria:
- Did your report show up on time?
- Are the reported URLs plausible?
Milestone #3
Goal: complete search engine
During this last stretch, you will improve your search engine in the following way.
Come up with a set of at least 20 queries that guide you in evaluating how well your
search engine performs, both in terms of ranking performance (effectiveness) and in
terms of runtime performance (efficiency). At least half of those queries should be
chosen because they do poorly on one or both criteria; the other half should do well.
Then change your code to make it work better for the queries that perform poorly,
while preserving the good performance of the other ones, and while being as
general as possible.
Deliverables:
• Submit a zip file containing all the programs you wrote for this project, as
well as a document with your test queries (no need to report the results).
Comment on which ones started by doing poorly and explain what you did to
make them perform better.
• A live demonstration of your search engine
Evaluation criteria:
- Does your search engine work as expected of search engines?
- How general are the heuristics that you employed to improve the retrieval?
- Is the search response time under the expected limit?
- Do you demonstrate in-depth knowledge of how your search engine works?
Are you able to answer detailed questions pertaining to any aspect of its
implementation?
Note for the developer option: at the end of the project, you should have the
optimized index that allows you to run both the indexer and the search with small
memory footprint, smaller than the index size.
Understanding the Dataset
In Project 2, your crawlers crawled the many web sites associated with ICS. We
collected a big chunk of these pages and are providing them to you as two zip files:
analyst.zip and developer.zip. The names are self-explanatory: the former is for
the analyst flavor of the project; the latter is for the algorithms and data structures
developer option. The only difference between them is the size: analyst.zip contains
only 3 domains and a little over 2,000 pages, while developer.zip contains all 88
domains found during crawling and a little under 56,000 pages.
The following is an explanation of how the data is organized.
Folders:
There is one folder per domain. Each file inside a folder corresponds to one web
page.
Files:
The files are stored in JSON format, with 2 fields only:
• “url” : contains the URL of the page. (ignore the fragment part, if you see it)
• “content” : contains the content of the page, as found during crawling
Broken or missing HTML:
Real HTML pages found out there are full of bugs! Some of the pages in the dataset
may not contain any HTML at all and, when they do, it may not be well formed. For
example, there might be an open tag but the associated closing
tag might be missing. While selecting the parser library for your project, please
ensure that it can handle broken HTML.