Skip to content

Project 1 Requirements

Tip

  1. Use the command $ ant build_analyzer to compile and build the class.
  2. Use the command $ ant run_tests_analyzer to test the class.
  3. Use the script $ ./runAnalyzer.sh YourFilePath to run the class.

Project Overview

You will build a project containing a package of classes. The project runs on the command line, which will process a text file and output a series of reports on the contents of the file. As the semester progresses we will add new functionality to the project.

Special note on project code

Projects must rely on coded solutions that have been covered in the class material, unless otherwise specified.

If the coded solution uses concepts not covered in class, you are required to do a code review. This review must include a detailed explanation of your coding choices and rationale for employing a solution beyond the scope of the class material.


Terminology

Analyzer: The name of the program. This also is used to reference a class that performs analysis. In this project that includes the FileSummaryAnalyzer and DistinctTokensAnalyzer classes.

Tokens: A word or sting of characters in a file. Tokens do not include punctuation, do include numbers. For example, the follow sentence contains 15 tokens, but only 14 distinct tokens (very is used twice).

Capitalization is not taken into consideration, so The and the are two different tokens.

"The @#42$% very, very, large robot jumped over the {^*!} moon, shouting "Hello World!" with excitement!"

List of tokens in the above sentence.

  1. The
  2. 42
  3. very
  4. very
  5. large
  6. robot
  7. jumped
  8. over
  9. the
  10. moon
  11. shouting
  12. Hello
  13. World
  14. with
  15. excitement

Specifications

Tests

The application must pass all the provided unit tests. Running the tests will be covered in lectures and labs. It is strongly recommend that you use Test-Driven Development (TDD) and start the project by running the tests and proceed to code to make each test pass.


Code Quality

All java source files must comply with:


Input File

The analyzer needs an input file to process. The input file can be any ASCII file that contains text. This file can be the contents of books downloaded from the Internet, documentation for software, email from your mom, anything you want. It is highly recommended that you test the application with several files of varying size, from very small to large.

File Size

You must be prepared to test your application with a file that is bigger than 20MB with more than 4 million tokens. Your application must process this file properly. I will provide the "big file" in Slack for you to test against. Be sure that you test your application with this file before code review time and compare your output to the expected output. The expected number of total tokens for the "Big File" can be found in screenshots/project-1/code-review-prep.md.


Generating the Tokens

For the purposes of this project, you must use the regex, \\\\W to split the tokens out of each line in the input file:

inputLine.split("\\W");

No Empty Tokens Allowed

You will notice that the code above results in some empty tokens, or tokens with a length of zero. You will need to write a small bit of code to ignore any empty tokens, meaning, do not pass empty tokens to your analyzers for processing.


Project Directory Structure

Your project should have the following directory structure. This style of directory structure is very commonly used by Java programmers. We will make this in the first lab, don’t bother trying to do this now.

This structure will hold all projects for the course.

projects/
|-- build.xml
|-- docs/
|-- lib/
|-- output/
|-- README.md
|-- screenshots
|-- src/
    |-- java112/
        |-- analyzer/
        |-- labs1/
        |-- utilities/

Description of each directory:

  • The projects directory can be anywhere that you would like. This directory must be at the top level of your directory.
  • The docs directory will contain the documentation for the application in javadoc format.
  • The lib directory will contain the .jar files for the project.
  • The output directory will contain the output generated by your application.
  • The src directory will contain all of your .java source code files in their correct package layout.
  • The analyzer directory will contain the source code that is only part of project 1.

Exception Handling

There are places in this application that need exception handling. All exceptions that happen during the running of this application must be caught and the stack trace displayed to the command window. One exception that should be tested is if the input files are not found on the disk. This can be easily tested by entering a file name that is not on your computer.


Documentation

All source code must use the javadoc style of comments.

  • All classes should have a thorough explanation of that class in the class comment section.
  • All methods and constructors must be documented with all input parameters, the return type (if applicable) and thorough descriptions.

Packaging the Application

The source code for the application should be placed in a package named java112.analyzer. This will result in a directory structure in your projects/src directory.


Output Files

Output

All output files must be written to the output/ directory.

distinct_tokens.txt

This file will contain all the distinct tokens in the input file.

  • Must be named distinct_tokens.txt
  • ONLY one token on each line
  • Do not include any headers, labels, or extra text
  • No duplicates in the file

summary.txt

This file will contain summary information about the analysis of the input file. It must be named summary.txt.

It will contain the following information in this order:

  1. The name of the application generating the report. Make up a name for your application or call it "The Analyzer".
  2. Your name.
  3. Your email address.
  4. The absolute path of the input file that was analyzed.
  5. The date and time the file that was analyzed.
  6. The last modified date of the analyzed file.
  7. The file size in bytes.
  8. The file URI of the analyzed file.
  9. The total number of individual tokens in the document.

Example:

Application: File Magic
Author: Eric Knapp
Author email: eknapp@madisoncollege.edu
File: /home/student/thomas-paine.txt
Date of analysis: Thu Jan 11 16:21:28 CST 2018
Last Modified:    Wed Jan 10 21:18:44 CST 2018
File Size: 2375590
File URI: file:/home/student/thomas-paine.txt
Total Tokens: 397952

Hint

Review the java.io.File documentation for information on how to get some of the file-related values such as file size, absolute path, uri, etc.


Project Classes

Analyzer Package

All classes will be in the java112.analyzer package.

Driver Class

Driver Class Purpose

  • The purpose of the Driver class is to kick off the Analyzer program.

  • What does it know? The FileAnalysis class (although there is no instance variable)

  • What does it do? Starts the Analyzer program with an input file to be analyzed.

Driver Class Details

  • The name of the class must be Driver.
  • The class will instantiate an instance of the project’s main processing class.
  • The class will call the main processing method of the main class passing the command line arguments array to the method.
  • No other code will be accepted for this class.

FileAnalysis Class

FileAnalysis Class Purpose

  • The purpose of the FileAnalysis class is coordinate the file analysis.

  • What does it know? The number of arguments allowed, output file paths, FileSummaryAnalyzer class, and the DistinctTokensAnalyzer class

  • What does it do? This is the main controller for the project. It opens the input file, reads the input file, parses the input file it into individual tokens, and sends those tokens to the analyzers for processing. Once processing is complete, it passes the name of the input file to the analyzers for generating the output report.

FileAnalysis Class Details

  • Must be named FileAnalysis.
  • The class will have a constant for the valid number of command-line arguments.
  • The class will have instance variable for each Analyzer class. These variables must be named summaryAnalyzer and distinctAnalyzer.
  • The class must use many methods to perform it’s tasks. You can expect that the instructor will ask you to break up methods into more methods.
  • Each method should perform only one task.
  • All loops must be in their own method and no loops may be nested within other loops. Use method calls instead.
  • The class will have a method named analyze with the following signature:

    public void analyze(String[] arguments)
    

analyze() Method Details

  • The method will first test if the correct number of arguments have been entered by the user when running the application. For project 1, this number will be 1. If the correct number is not entered then the application must output a message to the command line asking for the right input and then terminate the program.
  • The method will then call other methods to perform these tasks:
    • Create an instance of each Analyzer class and assign each instance to their respective instance variables: summaryAnalyzer and distinctAnalyzer.
    • Open the input file.
    • Loop through all the lines of the input file and generate individual tokens.
    • Pass generated tokens, one at a time, to all Analyzer instances via the processToken() method.
    • Call the generateOutputFile() method for each Analyzer class in a method named writeOutputFiles().

TokenAnalyzer Interface

TokenAnalyzer Class Purpose

  • The purpose of the TokenAnalyzer class is define the method signature for processing tokens and generating output.

  • What does it know? Nothing.

  • What does it do? Defines method signatures.

TokenAnalyzer Class Details

  • The interface must be named TokenAnalyzer
  • It is implemented by any class that performs an analysis.
  • The interface will have two methods:
void processToken(String token)
void generateOutputFile(String inputFilePath, String outputFilePath)

FileSummaryAnalyzer class

FileSummaryAnalyzer Class Purpose

  • The purpose of the FileSummaryAnalyzer class is to determine information about the author, the analyzed file (including the number of tokens), and to output the results.

  • What does it know? Total number of tokens.

  • What does it do? Processes tokens and outputs results.

FileSummaryAnalyzer Class Details

  • Must be named FileSummaryAnalyzer.
  • Implements the TokenAnalyzer interface.
  • Has a zero-parameter constructor.
  • Has a totalTokensCount instance variable and getter method.
  • Do not create a setter method for the totalTokensCount variable.
  • No other instance variables are allowed for this class.
// Only allowed instance variable
    private int totalTokensCount;

    public int getTotalTokensCount() {
        return totalTokensCount;
    }

DistinctTokensAnalyzer class

DistinctTokensAnalyzer Class Purpose

  • The purpose of the DistinctTokensAnalyzer class is to determine the number of distinct tokens in a file and output the results.

  • What does it know? The distinct tokens.

  • What does it do? Processes tokens and outputs results.

DistinctTokensAnalyzer Class Details

  • Must be named DistinctTokensAnalyzer.
  • Implements the TokenAnalyzer interface.
  • Has a zero-parameter constructor.
  • Has a distinctTokens instance variable and getter method.
  • Do not create a setter method for the distinctTokens variable.
  • No other instance variables will be allowed for this class.
  • The zero-parameter constructor creates an instance of a TreeSet and assigns it to the distinctTokens variable.
// Only allowed instance variable
    private Set<String> distinctTokens;

    public Set<String> getDistinctTokens() {
        return distinctTokens;
    }

Rubric

All of the following must be satisfied to achieve a "Met" status

Criteria Met Status
Screenshots Screenshots clearly show tests passing, expected output, error handling, and no JavaDoc errors.
Debugging & Problem-Solving Code is free from errors, and all provided tests pass without any issues.
Code Quality Code is exceptionally clean, efficient, and maintainable. Follows best practices, coding standards, and programming principles.
Git and GitHub You have consistently uses Git and GitHub for version control. Commits are meaningful and atomic. At least 25 commits messages have been made during Unit 1.
Java IO Correct Java IO classes are used to read and write data to files.
Exception Handling Exception handling is used correctly throughout the application, including providing user-friendly error messages. Try-with-resources is implemented correctly.
Collections The appropriate concrete implementation of the Collections interface is used, such as Sets and Lists.
Functionality The FileSummaryAnalyzer and DistinctTokensAnalyzer produce the correct output. The total number of tokens in the BigFile is 4468588. The last unique token in the list is zygomaticus.
Code Documentation All classes, methods, instance variables, and constructors are thoroughly documented with accurate descriptions and proper JavaDoc comments.
External Sources External sources (websites, classmates, AI tools, etc), if utilized, are referenced and documented within the code as comments.
Reflection Issue created correctly with thoughtful answers to the reflection questions.