R Language Assignment Question 3
Assignment 1 – CSC3060 “AIDA”
Introduction
In this assignment, you will:
- Create a dataset of handwritten symbols (which you will use for your analyses and experiments in the rest of Assignment 1, and in Assignment 2)
- Calculate features (i.e. variables) from the handwritten symbols which may be useful for distinguishing between the different symbols automatically
- Perform statistical analysis of the datasets, using methods of statistical inference.
When you use a procedure that has an element of randomness, please use the seed value 3060 (your code should give the same results each time it runs).
Sections 1 and 2 of this Assignment can be completed in one of the following programming languages: Python, R, Java. Section 3 must be completed in R.
Please read carefully the information about the assessment criteria and marking process at the end of this document.
Section 1 (8%): Creating a dataset
This section asks you to build a dataset of images composed of written numbers, letters and mathematical symbols. Each image is represented by a black & white matrix with size 20 rows by 20 columns. In the matrix, the number “1” represents black pixels and “0” represents white pixels. As such, one image can be stored in a plaintext “.csv” file containing the matrix (and no headers), as in these examples:
Classa b 1 3
Example Image
Image Matrix csv file
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0 0,0,0,0,0,1,1,1,0,0,0,0,1,1,0,0 0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0 0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0 0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0 0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0 0,0,0,1,1,0,0,0,0,0,0,0,1,1,0,0 0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0 0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0 0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0 0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0 0,0,0,0,0,0,1,1,0,1,1,1,0,1,0,0 0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,1,1,1,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0 0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0 0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0 0,0,1,0,0,1,1,1,0,0,0,1,0,0,0,0 0,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0 0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0 0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0 0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0 0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Figure 1: Examples of handwritten images and their matrix representation.
The goal is to create a dataset containing eight handwritten images of each of the digits {1,2,3,4,5,6,7}, eight handwritten images of each of the digits {a,b,c,d,e,f,g}, and 8 handwritten images of the mathematical symbols {<, >, =, ≤, ≥, ≠, ≈}. We will refer to these as the digit, letter and math datasets, respectively. Each image should be obtained by writing a hand-written symbol yourself (preferably with a touch screen, using the lab computers, although it is fine if you create them using the computer mouse). The quality of the drawing is not essential, as long as the digit or letter can easily be read by a human. The image will vary from sample to sample; however, each character should fit reasonably well in the 20×20 box (i.e. do not draw a tiny character in one corner of the 20×20 box; this will make your life easier when it comes to doing analyses!).
You may use whatever means you prefer to obtain the images and .csv files. However, a suggestion is to use the software GIMP (http://www.gimp.org). Using GIMP, you can create a new image with 20 by 20 points (pt), advanced options 1 pixel/pt, color space grayscale, fill with background colour. This will give you a small white square, which you can magnify to e.g. 2000% in order to make it easier to draw on. To draw on the image, you can select the pencil tool and adjust the brush size to (e.g.) 1 pixel. The standard file formats of GIMP are useful to save the images, but we need a more easily readable format. One good option is to export as PGM, type ASCII. In this format, each image becomes a text file with a header consisting of the following four lines:
P2
# CREATOR: … 20 20
255
The third and fourth lines of the header specify the pixel array size and the maximum allowed pixel value, respectively. (The images are greyscale, with 0 representing fully black and 255 representing fully white).1
The remaining lines of the file specify the pixel values, with one value on each line; the total number of pixel values should correspond to the specified array size (i.e. 20*20=400).
For our purposes, a number < 128 represents a black pixel, while a number >= 128 represents a white one. Such a format can be easily converted into a matrix containing ones and zeros, as presented in Figure 1 above. You shall save each image matrix as a csv file following the specification above, and using the filename STUDENTNR_LABEL_INDEX.csv, where STUDENTNR is your student number (e.g. 123456), INDEX is a number from 1 to 8, indexing the set of eight images you must create for each symbol, and LABEL is a numeric code that uniquely identifies the type of symbol.
We will use the following codes to label the different types of images:
1 For further information about this image format, see https://en.wikipedia.org/wiki/Netpbm_format Page 2 of 8
Symbol | Label |
1 | 11 |
2 | 12 |
3 | 13 |
4 | 14 |
5 | 15 |
6 | 16 |
7 | 17 |
Symbol | Label |
a | 21 |
b | 22 |
c | 23 |
d | 24 |
e | 25 |
f | 26 |
g | 27 |
Symbol | Label |
< | 31 |
> | 32 |
= | 33 |
≤ | 34 |
≥ | 35 |
≠ | 36 |
≈ | 37 |
For example, if your student number is 123456, then 123456_25_8.csv would be the eighth image you created for the letter ‘e’. (As well as creating the csv files, you may also want to keep the PGM files, in case you need to inspect the data later on).
As part of your submission, upload the csv files that you create in a directory called “section1_images”, along with any code you wrote to create the csv files, in a folder called “section1_code” (see submission instructions at the end of this document).
It is important to upload the images in the correct csv format as these files will be used to verify your calculations in the next section.
In your report, briefly explain in your own words how you created the images and obtained the matrices from them.
Section 2 (10%): Feature engineering
Using each 20×20 matrix obtained from an image as described above, you must create an array of characteristics that describe some features of the image. Each feature will be a number (i.e. each feature is a numeric variable). There are 18 features in total. In the feature definitions that follow, a pixel has 8 neighbours, which will be referred to as follows:
Features to be calculated:
Feature Index | Feature Short Name | Feature Description |
label | The true symbol in the image (represented by one of 24 LABEL codes). Note that the label is not a true feature, and should not be used as a feature for statistical tests or during model training. | |
index | The index of this image instance. |
1 | nr_pix | The number of black pixels in the image. |
2 | height | Number of rows containing at least one black pixel |
3 | width | Number of columns containing at least one black pixel |
4 | tallness | Ratio of height to width; i.e. feature 2 / divided by feature 3 |
5 | rows_with_1 | Number of rows with exactly one black pixel |
6 | cols_with_1 | Number of columns with exactly one black pixel |
7 | rows_with_5+ | Number of rows with five or more black pixels |
8 | cols_with_5+ | Number of columns with five or more black pixels |
9 | 1neigh | Number of black pixels with exactly 1 neighbouring pixel |
10 | 3+neigh | Number of black pixels with 3 or more neighbours |
11 | none_below | Number of black pixels with no neighbours in the lower-left, lower, or lower-right positions |
12 | none_above | Number of black pixels with no neighbours in the upper-left, upper, or upper-right positions |
13 | none_before | Number of black pixels with no neighbours in the upper-left, left, or lower-left positions |
14 | none_after | Number of black pixels with no neighbours in the upper-right, right, or lower-right positions |
15 | nr_regions | Two black pixels A and B are connected if they are neighbours of each other, or if a black pixel neighbour of A is connected to B (this definition is actually symmetric); a connected region is a maximal set of black pixels which are connected to each other; this feature has the number of connected regions in the image. |
16 | nr_eyes | In a written character, an “eye” is a region of whitespace that is completely surrounded by lines of the character. For example, “A” contains one eye, “B” contains two eyes, and “C” contains no eyes. A region of whitespace is an eye if there is a ring of black pixels surrounding it which are all connected (i.e. they form a chain of neighbours). This feature is the number of eyes in the image. |
17 | r5_c5 | Number of rows with at least five black pixels) minus (number of columns with at least five black pixels. |
18 | bd | Design your own feature which you think may be useful for distinguishing between “b” and “d” character images. |
19 | [your label] | Design any other feature you like, which you think may be useful for distinguishing between symbols. |
20 | [your label] | Design any other feature you like, which you think may be useful for distinguishing between symbols. |
Your task in this section is to write code to calculate each of the features above. In calculating pixel neighbours, you can assume that the images are padded on each side with white pixels.
Save your calculated features in a file called STUDENTNR_features.csv, where STUDENTNR is your student number. This file will consist of 168 rows, with each row listing the comma-separated feature values for each of your 168 images. The first entry in the row will be the LABEL code, the second will be the image INDEX, and the remaining 20 entries will be the calculated features.
For example, the features for your eighth “e” image may be as follows:
25, 8, 4, 28, 14, 12, 1.1667, 8, 8, 1, 2, 4, 11, 8, 7, 12, 11, 1, 2, 1, 0.11, 22
The 8 rows that correspond to the 8 instances of a particular character should be grouped together in the features file, and the order of the 8 rows should correspond to the INDEX used in the image filenames. In other words, the 168 rows of STUDENTNR_features.csv should be sorted first by the label and secondly by the index.
If you cannot calculate a particular feature, you may use a random integer between 0 and 10 for the feature values instead. (You will lose marks for not calculating the feature, but you can use the random values in the analyses that follow in the subsequent section).
In your report, briefly describe and explain the code you have written to calculate the features above. If you ran into difficulties, you should still explain your thought processes and attempts to calculate the features. In the case of features 19 and 20, you should explain your rationale for choosing the features you did, as well as how they are calculated (i.e. you should give a justification for why you think these features should be useful).
You should put the file STUDETNR_features.csv in a folder called section2_features. Put code for this section in a folder called section2_code. Your code should use relative paths; i.e. it should read the image matrixes from “../section1_images” and save the feature file to “../section2_features”.
Section 3: Statistical analyses of feature data (12%)
In this section, you will perform statistical analyses of the feature data, in order to explore which features are important for distinguishing between different kinds of symbols.
You shall use descriptive statistics (mean, variance, etc.), null hypothesis testing, and confidence intervals to perform your analysis of the data. You are encouraged to provide tables, figures, and/or graphs in the report to support your discussions and findings. When performing tests, always consider whether multiple test correction is needed.
It is your responsibility to define the appropriate assumptions to run the tests, and to choose an appropriate test according to the data characteristics and the question that you are studying. You are not restricted to the hypothesis tests that were discussed in the lectures. Recall to always justify the approach that you choose to employ. You may assume a significance level of 0.05 for the analyses when running hypothesis testing.
In particular, in the report you should address each of the following subtasks, using appropriate statistical tests, tables, graphs, etc.
- Estimate the probability distribution for nr_pix for each of the three symbol groups: letter, digit, and math. Visualise the distributions. Briefly describe the shape of the distributions.
- Suppose you randomly sample a digit image from the set of digits. What is the probability that the number of pixels in the image is greater than 20?
- Present summary statistics (e.g. mean and standard deviation) about all the features, for (a) the full set of 168 items, (b) the 56 digits, (b) the 56 letters, (c) the 56 math symbols. Briefly discuss the summary statistics, and whether they already suggest which features may be useful for discriminating digits and letters. For features you feel may be interesting, consider suitable visualisations (e.g. histogram of feature values for the three groups2)
- Are there pairs of features which are highly associated with each other, and thus provide little extra information with respect to having only one of them in the data? Can you discard some features from your data set without losing much information? Justify your claims.
- Are there features which are useful to discriminate between the 7 different letters? (Note that here we are looking for differences between 7 different groups, so consider a statistical method that tests for statistically significant differences between more than 2 groups).
- Are there features which are useful to discriminate between the 7 different digits? (Note that here we are looking for differences between 7 different groups, so consider a statistical method that tests for statistically significant differences between more than 2 groups).
- Are there features which are useful to discriminate between the three groups? How confident are you about your findings?
- Are there features which are useful to discriminate between the set of digits and the set of letters? (Consider statistical tests which test for differences between two groups).
- Are there features which are useful to discriminate between the digit “1” and the digit “7”? Briefly interpret your findings.
- Are there features which are useful to discriminate between the letter “b” and the letter “d”? Briefly interpret your findings.
- Are there features which are useful to discriminate between the symbol “=” and the symbol “≠”? Briefly interpret your findings.
- For each feature, find the pairs of digits (if any) that have a statistically significant difference for that feature (carefully consider multiple comparison correction).
- For each feature, find the pairs of letters (if any) that have a statistically significant difference for that feature (carefully consider multiple comparison correction).
- For each feature, find the pairs of math symbols (if any) that have a statistically significant difference for that feature (carefully consider multiple comparison correction).
- Using the best feature that you have discovered for discriminating between digits and letters, find a suitable threshold value for that feature to predict digit/letter in the best possible way (such that values to one side of that threshold are associated with digits and values to the other side of the threshold are associated with letters). How accurate is this simple threshold- based classifier?
- Using combinations of simple thresholds for certain features, how accurately can you discriminate between the three groups?
For all questions above, you shall explain your reasoning, assumptions and steps of the procedure (including the statistical analysis) when preparing the report. Use statistics to justify your reasoning. If you are generating p-values for analysing the statistical significance of some features, make sure to explain how they were obtained. It is your task to decide and justify what the most appropriate inference to be performed in each case is, and to discuss the results you obtained.
Put code for this section in a folder called section3_code. Your code should use relative paths; i.e. it should read the feature data from “../section2_features”. It should be straightforward for the assessor to rerun your code to produce the same results as presented in your report. Ensure that the different subsections (1 to 16) are clearly labelled in the source code, and in the report.
Assessment criteria and marking process
The most important criteria in marking is the completeness, accuracy, quality and clarity of your report (approximately 70% weighting). In your report, you should clearly demonstrate that you understand the methods used in each sub-task. Explain your reasoning, assumptions and steps of the procedures used. You should explain and interpret your results, demonstrating understanding and independent thinking. What are your results telling you? Are the results what you would expect? If you ran into difficulties, explain what they were and the efforts you made to try to overcome them.
Code has a weighting in marking of approximately 20%. Your code should be clear and logically organised, and do what is required, but code efficiency and code sophistication is not important (this assignment does not require complex programming). Where appropriate, use loops and variables rather than hard-coded values. If you use freely licenced code, packages, or libraries (which is encouraged), these should be appropriately referenced (e.g. by citing a URL in a comment). For example, using StackOverflow code snippits is fine, provide you acknowledge the use and provide the URL to the code snippits in the comments, and follow the MIT licence. The code must be easy to use and the comments must include information about the required steps to replicate the results that you have obtained and are presenting in your report (transparency and replicability are essential in data analysis).
Attention to detail and following the assignment instructions accurately will also be considered in marking (approximately 10% weighting). Each sub-task has a precise specification. Make sure you carefully follow the instructions, and use the features specified for each task, the specified procedures (seed value, data file specifications and file names, directory structure and names, etc). Make sure you upload your deliverable files in the specified formats.
Deliverables
You must submit your assignment online, using the QOL webpage of the AIDA module, by 5pm Friday, 16th November 2018.
The online uploaded file must be a ZIP file called Assignment1_STUDENTNR.zip, containing multiple files and directories. The contents of the zip file are specified below (bold text indicates folder names):
STUDENTNR_assignment1_report.pdf section1_code
o [your code files] section1_images
o [168 .csv files with the following naming format: STUDENTNR_LABEL_INDEX.csv] section2_code
o [your code files] section2_features
o STUDENTNR_features.txt section3_code
o [your code files]
A RAR file is not a ZIP file. A broken or corrupt ZIP file is not a ZIP file.
It is your responsibility to ensure the assignment is uploaded and double-checked before the deadline. As a backup, it is recommended that a paper copy of the report is also submitted at the CSB General Office (c/o Brian Fleming) on the day of the deadline, although submitting a paper copy is not compulsory.
Please use the provided report template for preparing your report (or create an equivalent LaTEX format). Insure that the header and footer information (student name, student number) is clearly visible on the printout. The word limit for the report is 4000 words (excluding tables and figures).