Project 1 Requirements
Tip
- Use the command
$ ant build_analyzer
to compile and build the class. - Use the command
$ ant run_tests_analyzer
to test the class. - Use the script
$ ./runAnalyzer.sh YourFilePath
to run the class.
Project Overview
You will build a project containing a package of classes. The project runs on the command line, which will process a text file and output a series of reports on the contents of the file. As the semester progresses we will add new functionality to the project.
Special note on project code
Projects must rely on coded solutions that have been covered in the class material, unless otherwise specified.
If the coded solution uses concepts not covered in class, you are required to do a code review. This review must include a detailed explanation of your coding choices and rationale for employing a solution beyond the scope of the class material.
Terminology
Analyzer: The name of the program. This also is used to reference a class that performs analysis. In this project that includes the FileSummaryAnalyzer
and DistinctTokensAnalyzer
classes.
Tokens: A word or sting of characters in a file. Tokens do not include punctuation, do include numbers. For example, the follow sentence contains 15 tokens, but only 14 distinct tokens (very is used twice).
Capitalization is not taken into consideration, so The
and the
are two different tokens.
"The @#42$% very, very, large robot jumped over the {^*!} moon, shouting "Hello World!" with excitement!"
List of tokens in the above sentence.
- The
- 42
- very
- very
- large
- robot
- jumped
- over
- the
- moon
- shouting
- Hello
- World
- with
- excitement
Specifications
Tests
The application must pass all the provided unit tests. Running the tests will be covered in lectures and labs. It is strongly recommend that you use Test-Driven Development (TDD) and start the project by running the tests and proceed to code to make each test pass.
Code Quality
All java source files must comply with:
Input File
The analyzer needs an input file to process. The input file can be any ASCII file that contains text. This file can be the contents of books downloaded from the Internet, documentation for software, email from your mom, anything you want. It is highly recommended that you test the application with several files of varying size, from very small to large.
File Size
You must be prepared to test your application with a file that is bigger than 20MB with more than 4 million tokens. Your application must process this file properly. I will provide the "big file" in Slack for you to test against. Be sure that you test your application with this file before code review time and compare your output to the expected output. The expected number of total tokens for the "Big File" can be found in screenshots/project-1/code-review-prep.md
.
Generating the Tokens
For the purposes of this project, you must use the regex, \\\\W
to split the tokens out of each line in the input file:
inputLine.split("\\W");
No Empty Tokens Allowed
You will notice that the code above results in some empty tokens, or tokens with a length of zero. You will need to write a small bit of code to ignore any empty tokens, meaning, do not pass empty tokens to your analyzers for processing.
Project Directory Structure
Your project should have the following directory structure. This style of directory structure is very commonly used by Java programmers. We will make this in the first lab, don’t bother trying to do this now.
This structure will hold all projects for the course.
projects/
|-- build.xml
|-- docs/
|-- lib/
|-- output/
|-- README.md
|-- screenshots
|-- src/
|-- java112/
|-- analyzer/
|-- labs1/
|-- utilities/
Description of each directory:
- The projects directory can be anywhere that you would like. This directory must be at the top level of your directory.
- The docs directory will contain the documentation for the application in javadoc format.
- The lib directory will contain the .jar files for the project.
- The output directory will contain the output generated by your application.
- The src directory will contain all of your .java source code files in their correct package layout.
- The analyzer directory will contain the source code that is only part of project 1.
Exception Handling
There are places in this application that need exception handling. All exceptions that happen during the running of this application must be caught and the stack trace displayed to the command window. One exception that should be tested is if the input files are not found on the disk. This can be easily tested by entering a file name that is not on your computer.
Documentation
All source code must use the javadoc style of comments.
- All classes should have a thorough explanation of that class in the class comment section.
- All methods and constructors must be documented with all input parameters, the return type (if applicable) and thorough descriptions.
Packaging the Application
The source code for the application should be placed in a package named java112.analyzer
. This will result in a directory structure in your projects/src
directory.
Output Files
Output
All output files must be written to the output/
directory.
distinct_tokens.txt
This file will contain all the distinct tokens in the input file.
- Must be named
distinct_tokens.txt
- ONLY one token on each line
- Do not include any headers, labels, or extra text
- No duplicates in the file
summary.txt
This file will contain summary information about the analysis of the input file. It must be named summary.txt
.
It will contain the following information in this order:
- The name of the application generating the report. Make up a name for your application or call it "The Analyzer".
- Your name.
- Your email address.
- The absolute path of the input file that was analyzed.
- The date and time the file that was analyzed.
- The last modified date of the analyzed file.
- The file size in bytes.
- The file URI of the analyzed file.
- The total number of individual tokens in the document.
Example:
Application: File Magic
Author: Eric Knapp
Author email: eknapp@madisoncollege.edu
File: /home/student/thomas-paine.txt
Date of analysis: Thu Jan 11 16:21:28 CST 2018
Last Modified: Wed Jan 10 21:18:44 CST 2018
File Size: 2375590
File URI: file:/home/student/thomas-paine.txt
Total Tokens: 397952
Hint
Review the java.io.File documentation for information on how to get some of the file-related values such as file size, absolute path, uri, etc.
Project Classes
Analyzer Package
All classes will be in the java112.analyzer
package.
Driver Class
Driver Class Purpose
-
The purpose of the
Driver
class is to kick off the Analyzer program. -
What does it know? The
FileAnalysis
class (although there is no instance variable) -
What does it do? Starts the Analyzer program with an input file to be analyzed.
Driver Class Details
- The name of the class must be
Driver
. - The class will instantiate an instance of the project’s main processing class.
- The class will call the main processing method of the main class passing the command line arguments array to the method.
- No other code will be accepted for this class.
FileAnalysis Class
FileAnalysis Class Purpose
-
The purpose of the
FileAnalysis
class is coordinate the file analysis. -
What does it know? The number of arguments allowed, output file paths, FileSummaryAnalyzer class, and the DistinctTokensAnalyzer class
-
What does it do? This is the main controller for the project. It opens the input file, reads the input file, parses the input file it into individual tokens, and sends those tokens to the analyzers for processing. Once processing is complete, it passes the name of the input file to the analyzers for generating the output report.
FileAnalysis Class Details
- Must be named
FileAnalysis
. - The class will have a constant for the valid number of command-line arguments.
- The class will have instance variable for each Analyzer class. These variables must be named
summaryAnalyzer
anddistinctAnalyzer
. - The class must use many methods to perform it’s tasks. You can expect that the instructor will ask you to break up methods into more methods.
- Each method should perform only one task.
- All loops must be in their own method and no loops may be nested within other loops. Use method calls instead.
-
The class will have a method named
analyze
with the following signature:public void analyze(String[] arguments)
analyze()
Method Details
- The method will first test if the correct number of arguments have been entered by the user when running the application. For project 1, this number will be 1. If the correct number is not entered then the application must output a message to the command line asking for the right input and then terminate the program.
- The method will then call other methods to perform these tasks:
- Create an instance of each Analyzer class and assign each instance to their respective instance variables:
summaryAnalyzer
anddistinctAnalyzer
. - Open the input file.
- Loop through all the lines of the input file and generate individual tokens.
- Pass generated tokens, one at a time, to all
Analyzer
instances via theprocessToken()
method. - Call the
generateOutputFile()
method for each Analyzer class in a method namedwriteOutputFiles()
.
- Create an instance of each Analyzer class and assign each instance to their respective instance variables:
TokenAnalyzer Interface
TokenAnalyzer Class Purpose
-
The purpose of the
TokenAnalyzer
class is define the method signature for processing tokens and generating output. -
What does it know? Nothing.
-
What does it do? Defines method signatures.
TokenAnalyzer Class Details
- The interface must be named
TokenAnalyzer
- It is implemented by any class that performs an analysis.
- The interface will have two methods:
void processToken(String token)
void generateOutputFile(String inputFilePath, String outputFilePath)
FileSummaryAnalyzer class
FileSummaryAnalyzer Class Purpose
-
The purpose of the
FileSummaryAnalyzer
class is to determine information about the author, the analyzed file (including the number of tokens), and to output the results. -
What does it know? Total number of tokens.
-
What does it do? Processes tokens and outputs results.
FileSummaryAnalyzer Class Details
- Must be named
FileSummaryAnalyzer
. - Implements the
TokenAnalyzer
interface. - Has a zero-parameter constructor.
- Has a
totalTokensCount
instance variable and getter method. - Do not create a setter method for the
totalTokensCount
variable. - No other instance variables are allowed for this class.
// Only allowed instance variable
private int totalTokensCount;
public int getTotalTokensCount() {
return totalTokensCount;
}
DistinctTokensAnalyzer class
DistinctTokensAnalyzer Class Purpose
-
The purpose of the
DistinctTokensAnalyzer
class is to determine the number of distinct tokens in a file and output the results. -
What does it know? The distinct tokens.
-
What does it do? Processes tokens and outputs results.
DistinctTokensAnalyzer Class Details
- Must be named
DistinctTokensAnalyzer
. - Implements the
TokenAnalyzer
interface. - Has a zero-parameter constructor.
- Has a
distinctTokens
instance variable and getter method. - Do not create a setter method for the
distinctTokens
variable. - No other instance variables will be allowed for this class.
- The zero-parameter constructor creates an instance of a
TreeSet
and assigns it to thedistinctTokens
variable.
// Only allowed instance variable
private Set<String> distinctTokens;
public Set<String> getDistinctTokens() {
return distinctTokens;
}
Rubric
All of the following must be satisfied to achieve a "Met" status
Criteria | Met Status |
---|---|
Screenshots | Screenshots clearly show tests passing, expected output, error handling, and no JavaDoc errors. |
Debugging & Problem-Solving | Code is free from errors, and all provided tests pass without any issues. |
Code Quality | Code is exceptionally clean, efficient, and maintainable. Follows best practices, coding standards, and programming principles. |
Git and GitHub | You have consistently uses Git and GitHub for version control. Commits are meaningful and atomic. At least 25 commits messages have been made during Unit 1. |
Java IO | Correct Java IO classes are used to read and write data to files. |
Exception Handling | Exception handling is used correctly throughout the application, including providing user-friendly error messages. Try-with-resources is implemented correctly. |
Collections | The appropriate concrete implementation of the Collections interface is used, such as Sets and Lists. |
Functionality | The FileSummaryAnalyzer and DistinctTokensAnalyzer produce the correct output. The total number of tokens in the BigFile is 4468588. The last unique token in the list is zygomaticus. |
Code Documentation | All classes, methods, instance variables, and constructors are thoroughly documented with accurate descriptions and proper JavaDoc comments. |
External Sources | External sources (websites, classmates, AI tools, etc), if utilized, are referenced and documented within the code as comments. |
Reflection | Issue created correctly with thoughtful answers to the reflection questions. |