首页 > > 详细

Project 1 – Processing .sgm files

 Project 1 – Processing .sgm files.

In this project, you will reading-in and processing multiple .sgm files. There are 22 of them in 
the folder and each is named reut2-0??.sgm, where ?? denote the file number. In each file, there 
are hundreds of articles that look like the one found at the last page. Note that the articles are 
formatted like an html file in that it uses tags. Some articles may have missing tags. DO NOT try 
to fix these files or do any other kinds of modifications. 
Parser
In this project, you will create an .sgm parser. Your program will read through each file, find 
each article, and be able to pull out the information between the tags. Below are the specific 
things I would like you to do in your parser:
1. Be able to pull out the words within the topics and places tags and keep count of the 
words. You can combine the counts of the words found in these tags. 
2. Be able to pull out the words within the body tag and keep count of the words. These 
counts should be separated from the counts in part 1. 
3. In addition, keep count of the following things:
a. The number of articles you were able to reads successfully (either you got the 
places/topics, body, or both)
b. The number of articles where you were able to pull out the words from topics tag. 
c. The number of articles where you were able to pull out the words from places tag.
d. The number of articles where you were able to pull out the words from the body
tag. 
Note that these numbers may not be the same since the tags may not be complete. If the 
tag is not complete, do not count the words. For example, if there is a but no 
, then do not count the words after .
4. Write your results onto text files, one for both the topics and places and another for the 
body. Write the counts from part 3 at the top of both files. Name these files appropriately. 
Suggestions
Less words to count: 
• You may want to look up some of the common words in the English language (you can 
use up to 30 of them) and ignore these words when you are reading the body of the 
articles. 
• Another way to lessen the output is to implement a simple stemmer in your parser. A 
stemmer is an algorithm that strips a word down to some base word. For example, you 
could implement a suffix-stripping stemmer, which removes the suffixes we place at the 
end of verbs. 
Keeping the word count:
• One data structure you use that can hold a word and a number is the class you created in 
problem 6 of Homework 6. As you read through the articles, you can create new objects 
of this class when you encounter a new word or update the count when you read in the 
same word. Note that these objects have to be stored in a data structure that can hold 
multiple items like arrays or array lists. 
You can also use a map data structure, where the word (key) is mapped to a 
number(value) (think of the functions you work with in your math classes). There are a 
few implementations of the map data structure like HashMap and TreeMap. Whichever 
you choose, define it as Map, since the words are the keys and the count 
is the value. Note that this data structure already functions like an array or arraylist, 
where you can add multiple keys and values. So, do not combine it with the first 
suggestion. 
See https://docs.oracle.com/javase/8/docs/api/java/util/Map.html for more 
information on maps. 
Processing the files in one go:
• Since the file names are all the same except for the file number, you may want to use a 
for loop to loop through all the files.
What to submit
Submit the following through Moodle:
1. Your commented .java file or .java file together with a report explaining your program. 
If you commented your program and explained each method and non-trivial statements, 
then you do not need to submit a report that explains how your program works. If you 
feel your comments are not enough to explain your program or your comments are too 
long and distracts from the code, then submit a report detailing how your program works. 
Please format the report as a PDF. 
2. Your two output files. If you are submitting a separate report, you can paste these at the end 
of the report and not have to submit them separately. 
Sample article in an .sgm file. 
NEWID="1">
26-FEB-1987 15:01:01.79
cocoa
el-salvadorusauruguay
 
C T
f0704reute
u f BC-BAHIA-COCOA-REVIEW 02-26 0105

BAHIA COCOA REVIEW
SALVADOR, Feb 26 - Showers continued 
throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
The dry period means the temporao will be late this year.
Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
There are doubts as to how much of this cocoa would be fit
for export as shippers are now experiencing dificulties in
obtaining +Bahia superior+ certificates.
In view of the lower quality over recent weeks farmers have
sold a good part of their cocoa held on consignment.
Comissaria Smith said spot bean prices rose to 340 to 350
cruzados per arroba of 15 kilos.
Bean shippers were reluctant to offer nearby shipment and
only limited sales were booked for March shipment at 1,750 to
1,780 dlrs per tonne to ports to be named.
New crop sales were also light and all to open ports with
June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs
under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs
per tonne FOB.
Routine sales of butter were made. March/April sold at
4,340, 4,345 and 4,350 dlrs.
April/May butter went at 2.27 times New York May, June/July
at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
Destinations were the U.S., Covertible currency areas,
Uruguay and open ports.
Cake sales were registered at 785 to 995 dlrs for
March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times
New York Dec for Oct/Dec.
Buyers were the U.S., Argentina, Uruguay and convertible
currency areas.
Liquor sales were limited with March/April selling at 2,325
and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New
York July, Aug/Sept at 2,400 dlrs and at 1.25 times New York
Sept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith
said.
Total Bahia sales are currently estimated at 6.13 mln bags
against the 1986/87 crop and 1.06 mln bags against the 1987/88
crop.
Final figures for the period to February 28 are expected to
be published by the Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
Reuter

联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!