7022DATSCI Big Data Analysis
7022DATSCI—Mini-projects Master of Sensors Data and Management Big Data Analysis
Instructions:
- You should work on the mini-projects in groups of up to 5 students.
- Use electronic communication for organising your group work. Support wil be provided online via
- Together with your group, prepare a Powerpoint presentation of your project with 10 minutes recorded audio. All group members will receive the same mark for the presentation.
- In addition, everyone should hand in a one-page summary of the project.
- In the week after the deadline (week commencing Monday, 27th April 2020) each group should meet with me via videoconference and explain the code that they have produced for the project (“code walkthrough”). Each student will receive an individual mark for the code demonstration.
Project: Big Data Analysis
The aim of the Big Data Analysis project is to apply a machine learning method in a practical setting. In each of the following projects you are asked to...
- Work on a practical machine learning project.
- Present your work in a presentation.
You will work on your projects in groups of 3-5 students. The following list contains suggestions for project topics. Additional topics might become available and you can also suggest alternative topics:
- “3, 6, 8, 9?”—recognising hand-written digits with principal component analysis
Apply principal component analysis for recognising handwritten digits as explained in (Lu, 2017) (but without the pre-processing using Histograms of Oriented Gradients (HOG)) to the MNIST data set. http://yann.lecun.com/exdb/mnist/
- Googling food webs—the PageRank of extinction
Implement the variant of the PageRank algorithm described in (Allesina and Pascual, 2009) and reproduce the study for some of the food webs from this article. Note that some of the food webs are available in R by installing the cheddar library.
- MCMC for code cracking
A highly original application of Markov chain Monte Carlo (MCMC) was presented by (Diaconis, 2009) and extended by (Chen and Rosenthal, 2012). Implement and test the approach by reproducing the example described in (Diaconis, 2009).
References
Allesina, S., Pascual, M., 09 2009. Googling food webs: Can an eigenvector measure species’ importance for coextinctions? PLOS Computational Biology 5 (9), 1–6.
URL https://doi.org/10.1371/journal.pcbi.1000494
Chen, J., Rosenthal, J., 2012. Decrypting classical cipher text using Markov chain Monte Carlo. Statistics and Computing 22, 397–413.
URL https://doi.org/10.1007/s11222-011-9232-5
Diaconis, P., 2009. The Markov Chain Monte Carlo Revolution. Bulletin of the American Mathematical Society 46 (2), 179–205.
Lu, W., 2017. Handwritten digits recognition using PCA of histogram of oriented gradient. In: 2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM). pp. 1–5.
What you should hand in
- Each group: A Powerpoint presentation with 10 minutes recorded audio (25%).
- Every student: A one-page summary of your mini-project (25%)
- Every student: A text file containing your commented source code (50%).
Important! All group members will receive the same mark for the Powerpoint presentation, one-page summary and code demonstration will be marked individually.
Presentation/One-page summary |
Partial mark |
Introduction Brief description of your application Motivation: Which challenge are you going to address? |
5% |
Implementation What are the challenges of implementing the algorithm? Explain how you implemented the method. |
15% |
Results What have you found out about your data set? Show how your machine learning method addresses the challenge described in the Introduction. |
10% |
Discussion Brief summary of the analysis of the data Critically reflect how well the challenge described in the Introduction was solved by your machine learning approach. |
10% |
Formal marks Visual presentation Delivery of the talk Time keeping |
10% |
Total |
50% |
Source code (submitted to Canvas and demonstration) |
Partial mark |
Completeness of the implementation |
20% |
Demonstration |
10% |
Clarity of the code |
10% |
Quality of Comments |
10% |
Total |
50% |