owned this note
owned this note
Published
Linked with GitHub
---
tags: BigData-BS-2019
title: MapReduce. Assignment 1. Simple Text Indexer
---
# Assignment №1. MapReduce. Simple Text Indexer
**Due**: September 23, 2020
**Teams**: groups of 2 people
**Submission Format**: report in PDF, link to git repository, compiled binary
:::success
You are highly encoureged to read a [short guide](https://hackmd.io/@9NHMbs3cSOmGDKDUbhIviQ/H1PMURS37) on how to prepare a report. It is not necessary... but... don't tell we didn't warn you...
:::
In this assignment, you will practice on the university's Hadoop cluster. For this task, you are required to split into teams of 2 people. Please do this on your own capacity. Team representatives then should report to your TA to create a user on the cluster.
We encourage you to start working ASAP. First days, you will have all the cluster resources, but on the last day before the deadline, many tasks could be launched at the cluster simultaneously. On the last day, your submissions could run much slower. Please, pay attention that running the job could take time, so it is your responsibility to finish the assignment before the deadline.
Create a private git repository for this assignment. The personal commits are graded as well as the team performance.
The assignment requires you to prepare a report. It should include a description of your solution and your results. Provide an analysis of your results. The report should also state the contribution of each team member. The final report should be submitted in PDF format. The exact way you prepare the report does not play a significant role. Although, keep in mind that you can use git to store the report in LaTeX or Markdown formats.
[toc]
## Search Engine
In the last lab assignment, you have familiarized yourself with writing programs in Hadoop's MapReduce. In this assignment, you are going to enhance your skill of writing MapReduce tasks in Java.
The goal for this time is to implement a naive search engine. Conceptually, search engines were the first ones who tackled the problem of big data with the constraint of low latency response. Imagine an average search engine that has millions of documents in its index. Every second it receives hundreds to thousands of queries and requires to produce a list of the most relevant documents at sub-millisecond speed.
The problem of finding relevant information is one of the key problems in the field of Information Retrieval, which can be subdivided into two related tasks:
- Document indexing
- Query answering
While the second task is more related to low latency processing, the indexing step can be done offline and is more relevant to the idea of batch processing. Hadoop's MapReduce can be used to create the index of a text corpus that is too large to fit on one machine. For the sake of the assignment, you are going to do a naive implementation of both tasks using the MapReduce paradigm.
The diagram for such a search engine is shown at the figure below.
<p align="center">
<img src="https://www.dropbox.com/s/hkxjcimcx0g4lv0/Search%20Engine%20Flow.png?dl=1" width="250" />
</p>
Before going into the details of MapReduce programming, let us view the basic approach to information retrieval.
## Basics of Information Retrieval for Text
The most common task in IR is textual information retrieval. Whenever a user submits a query, the ranking engine should compute the set of the most relevant documents in its collection. To complete this task, the engineer should determine the representation format for both documents and queries, and define the measure of relevance of a query for a particular document. One of the most simple IR models is the TF/IDF Vector Space model.
### Vector Space Model for Information Retrieval
To facilitate the understanding of the vector space model lets define a toy corpus that consists of three documents
| Doc Id | Content |
| -------- | -------- |
| 1 | I wonder how many miles I’ve fallen by this time? |
| 2 | According to the latest census, the population of Moscow is more than two million. |
| 3 | It was a warm, bright day at the end of August. |
| 4 | To be, or not to be? |
| 5 | The population, the population, the population |
To define the vector space model, we need to introduce three concepts: vocabulary, term frequency (TF), and inverse document frequency (IDF).
#### Term
A term is a unique word.
#### Vocabulary
Vocabulary is a set of uniques terms present in the corpus. For the example above the vocabulary can be defined as
```
{'a', 'according', 'at', 'august', 'bright', 'by', 'census', 'day', 'end', 'fallen', 'how', 'i', 'is', 'it', 'i’ve', 'latest', 'many', 'miles', 'million', 'more', 'moscow', 'of', 'population', 'than', 'the', 'this', 'time', 'to', 'two', 'warm', 'was', 'wonder', 'or', 'not', 'be'}
```
For the ease of the further description, associate each term in vocabulary with the unique id
```
(0, 'a'), (1, 'according'), (2, 'at'), (3, 'august'), (4, 'bright'), (5, 'by'), (6, 'census'), (7, 'day'), (8, 'end'), (9, 'fallen'), (10, 'how'), (11, 'i'), (12, 'is'), (13, 'it'), (14, 'i’ve'), (15, 'latest'), (16, 'many'), (17, 'miles'), (18, 'million'), (19, 'more'), (20, 'moscow'), (21, 'of'), (22, 'population'), (23, 'than'), (24, 'the'), (25, 'this'), (26, 'time'), (27, 'to'), (28, 'two'), (29, 'warm'), (30, 'was'), (31, 'wonder'), (32, 'or'), (33, 'not'), (34, 'be')
```
#### Term Frequency
Term Frequency (TF) is the frequency of occurrence of a term $t$ in a document $d$. The previous documents can be represented with TF as follows, given the term format `(id, frequency)`
| Doc Id | Content |
| -------- | -------- |
| 1 | `(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 1), (6, 0), (7, 0), (8, 0), (9, 1), (10, 1), (11, 1), (12, 0), (13, 0), (14, 1), (15, 0), (16, 1), (17, 1), (18, 0), (19, 0), (20, 0), (21, 0), (22, 0), (23, 0), (24, 0), (25, 1), (26, 1), (27, 0), (28, 0), (29, 0), (30, 0), (31, 1), (32, 0), (33, 0), (34, 0)` |
| 2 | `(0, 0), (1, 1), (2, 0), (3, 0), (4, 0), (5, 0), (6, 1), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 1), (13, 0), (14, 0), (15, 1), (16, 0), (17, 0), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 0), (26, 0), (27, 1), (28, 1), (29, 0), (30, 0), (31, 0), (32, 0), (33, 0), (34, 0)` |
| 3 | `(0, 1), (1, 0), (2, 1), (3, 1), (4, 1), (5, 0), (6, 0), (7, 1), (8, 1), (9, 0), (10, 0), (11, 0), (12, 0), (13, 1), (14, 0), (15, 0), (16, 0), (17, 0), (18, 0), (19, 0), (20, 0), (21, 1), (22, 0), (23, 0), (24, 1), (25, 0), (26, 0), (27, 0), (28, 0), (29, 1), (30, 1), (31, 0), (32, 0), (33, 0), (34, 0)` |
| 4 | `(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0), (16, 0), (17, 0), (18, 0), (19, 0), (20, 0), (21, 0), (22, 0), (23, 0), (24, 0), (25, 0), (26, 0), (27, 2), (28, 0), (29, 0), (30, 0), (31, 0), (32, 1), (33, 1), (34, 2)` |
| 5 | `(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0), (16, 0), (17, 0), (18, 0), (19, 0), (20, 0), (21, 0), (22, 3), (23, 0), (24, 3), (25, 0), (26, 0), (27, 0), (28, 0), (29, 0), (30, 0), (31, 0), (32, 0), (33, 0), (34, 0)` |
As you can see, this representation format is very sparse, and we could preserve space if we would store it as a sparse representation (i.e., remove all zero entries)
| Doc Id | Content |
| -------- | -------- |
| 1 | `(5, 1), (9, 1), (10, 1), (11, 1), (14, 1), (16, 1), (17, 1), (25, 1), (26, 1)` |
| 2 | `(1, 1), (6, 1), (12, 1), (15, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (27, 1), (28, 1)` |
| 3 | `(0, 1), (2, 1), (3, 1), (4, 1), (7, 1), (8, 1), (13, 1), (21, 1), (24, 1), (29, 1), (30, 1)` |
| 4 | `(27, 2), (32, 1), (33, 1), (34, 2)` |
| 5 | `(22, 3), (24, 3)` |
#### Inverse Document Frequency
IDF show in how many documents the term has occurred. The measure signifies how common a particular term is. In case IDF is high, the presence of this term in the document does not help us to distinguish between documents. IDFs for our corpus are
```
(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 2), (22, 2), (23, 1), (24, 3), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)
```
You can see that the term `be` has occurred twice in the last document, but its IDF is only `1`. The term `the` occurred in document `2` and `3`; hence its IDF is `2`.
Sometimes IDF defined as
$$
IDF(t) = log \frac{N}{n(t)}
$$
where $N$ is the overall number of documents in the collection, and $n(t)$ is the number of documents containing the term $t$.
Other definitions are also possible. You are free to use any of them.
#### TF/IDF Weights
TF/IDF weights are nothing more than term frequencies normalized by IDF.
| Doc Id | Content |
| -------- | -------- |
| 1 | `(5, 1), (9, 1), (10, 1), (11, 1), (14, 1), (16, 1), (17, 1), (25, 1), (26, 1)` |
| 2 | `(1, 1), (6, 1), (12, 1), (15, 1), (18, 1), (19, 1), (20, 1), (21, 0.5), (22, 0.5), (23, 1), (24, 0.66), (27, 0.5), (28, 1)` |
| 3 | `(0, 1), (2, 1), (3, 1), (4, 1), (7, 1), (8, 1), (13, 1), (21, 0.5), (24, 0.33), (29, 1), (30, 1)` |
| 4 | `(27, 1), (32, 1), (33, 1), (34, 2)` |
| 5 | `(22, 1.5), (24, 1)` |
Here you can notice that terms with ids `{21, 22, 24, 27}` (`of`, `population`, `the`, `to`) are downscaled.
#### Basic Vector Space Model
In the basic vector space model, both documents and queries are represented with corresponding vectors, which capture TF/IDF weights of a document and the query.
The simplest way to convert TF/IDF weights to a vector interpreted by computer is to index the array with word Ids and record TF/IDF value. In this case, the vector for the document `5` will be
```
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```
The query `the population` will result in a vector
```
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0, 0.33, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```
The function that determines the relevance of a document to the query is the inner product (scalar product) of the two vectors
$$
r(q,d) = \sum_{i=1}^{|V|} q_i \cdot d_i
$$
where $|V|$ is the size of the vocabulary, and $q_i$ is the TF/IDF weight if the $i^{th}$ term in the query.
As you can see, the naive vector space representation with arrays is very sparse. Alternatively, you can create a sparse vector representation. The difference of a sparse vector from a regular vector is that sparse vector stores only non zero values. Such a vector can be implemented with a dictionary (in java terms - map).
Sparse vector representation for the document `5` is
```
[(22: 1.5), (24: 1)]
```
and for the query
```
[(22: 0.5), (24: 0.33)]
```
The relevance function is easily reformulated as
$$
r(q,d) = \sum_{i: i\in d, i\in q} q_i \cdot d_i
$$
where the summation is performed over all the terms that are present in both the query and the document. From the implementation standpoint the dictionary structure is more appealing because the keys are unique and form sets. The common terms for a document and a query can be easily found by set intersection (wich is usually very fast).
If we compute the relevance of the query `the population` for out toy corpus, the result would be
```
doc 5: 1.08
doc 2: 0.4678
doc 3: 0.1089
doc 4: 0
doc 1: 0
```
As you can see, even though the document `5` contains arguably less information, it is still the first according to our relevance rating.
##### Room for speedup
The indexing and retrieval steps experience a slowdown partially due to storing documents as arrays and having to map the word to Id with a `Map` data structure. This can be partially alleviated if we allow for some probability of mistake. Instead of creating an id for every word in the preprocessing step, and continuously reusing this mapping, we can generate an id on the fly using a hash function, completely eliminating word enumeration step. Depending on the allowable hash range, we can reduce the chance of collision significantly. The downside (or benefit) is added complexity for the retrieval engine, as now we must implement the vector space model strictly using sparse data structures.
### BM25
[Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is widely accepted as a much better ranking function for ad-hoc (no relevance data available) information retrieval. The formula looks like this
$$
r(q,d) = \sum_{i: i\in d, i\in q} IDF(d_i) \frac{d_i \cdot (k_1 + 1)}{d_i + k_1 \cdot (1 - b + b \cdot \frac{|d|}{avgdl})}
$$
For the sake of simplicity, we can assume the same definition for IDF as above, $d_i$ is term frequency, $avgdl$ is the average document length in the corpus. $|d|$ is the size of the current document, $b$ and $k_1$ are free parameters (usually set to 0.75 and 2.0).
This ranking function solves several problems with the vector space model approach, including the bias towards longer documents.
### Mean Average Precision
[MAP](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) is a metric widely applied for measuring ranking quality. The notion of relevance is often applied to estimate the quality of the ranker. If a document is useful for answering the query, it is called relevant. If it has no meaningful connection to the query - the document is not relevant.
The search engine returns the ordered list of documents, sorted by descending relevance. We want the top of the list to contain all of the relevant document for our query.
Assume that your ranker retrieves a list of documents of length $N_l$. MAP is the average of a metric called Average Precision over some set of queries
$$
MAP = \frac{1}{|Q|} \sum_{q \in Q} AP(q)
$$
The Average Precision is calculated as
$$
AP = \frac{1}{N_{rel}}\sum_{k=1}^{N_{l}} P(k) \cdot \text{rel}(k)
$$
where $N_rel$ is the number of relevant document returned by the ranker, $P(k)$ is the precision calculated for the positions from 1 to k, $\text{rel}(k)$ is an indicator function that is equal 1 when document at position $k$ is a relevant document, and 0 otherwise.
## Vector Space Model with MapReduce
One of possible ways to implement a naive search engine is shown on the diagram below.
<p align="center">
<img src="https://www.dropbox.com/s/wo2b6amni66djzj/Search%20Engine%20with%20MapReduce.png?dl=1" width="300" />
</p>
Given the collection of the documents, we first submit it to the indexing engine. The indexing engine starts a couple of tasks to assess the entire corpus.
1. Word Enumerator scans the corpus and creates a set of unique words. After that, it assigns a unique id to each word. This task can be implemented with MapReduce.
2. Document Count does a similar task, but instead of creating a set of unique terms it aims to count the IDF for each term, or simply the number of documents where each particular term appeared. This task can be implemented with MapReduce.
3. Vocabulary is an abstract concept and does not actually require to compute anything. It simply aggregates the result of Word Enumerator and Document Count into one data structure.
4. The Indexer has to compute a machine-readable representation of the whole document corpus. For each document, the Indexer creates a TF/IDF representation and stores a tuple of `(doc id, vector representation)`. Since each document is processed independently, this can be implemented with MapReduce.
After the index is created, it can be reused multiple times. The Ranking Engine has to create a vectorized representation for the query and perform the relevance analysis
1. The Query Vectorizer is the method that creates the vectorized representation of a query. It can be implemented as a part of the Relevance Analizator.
2. Relevance Analizator computes the relevance function between the query and each document. This task can be implemented with MapReduce (the performance depends on available hardware).
3. The index stores the document id and the vectorized representation. The Ranker provides the list of ids, sorted by relevance. The Content Extractor should extract the relevant content from the text corpus based on the provided relevant ids.
The Query Response contains the list of relevant content (for example links with titles).
***The model presented above shows one possible way to implement the system in a way that can be mapped into the MapReduce paradigm. You are not constrained to following this architecture, and the only limitation is that the whole system should be implemented with Hadoop's MapReduce.***
## Dataset Description
For the sake of excersize we are going to work with English Wikipedia Dump ([freely available](https://dumps.wikimedia.org/backup-index.html)). The dump is extracted in JOSN format:
```json
{"id": "0", "url": "https://en.wikipedia.org/wiki?curid=0", "title": "Sample Title", "text": "Sample Text"}
```
Later you will have the access to Hadoop cluster with HDFS. The Wikipedia dump resides at the location `/EnWikiSmall` on HDFS. The folder `EnWikiSmall` contains many files, each file contains multiple JSON strings, each string corresponding to one Wikipedia article.
In your Search Engine Implementation, you will need to work with a subset of the Wikipedia dump and provide it as an input to several MapReduce jobs.
You can use external JSON libraries for parsing the input. The classes `org.json.JSONObject` and `org.json.JSONException` are already available on the cluster.
## Developing
The first thing you should do is to study [Apache's MapReduce guide](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html). It provides describes MapReduce capabilities and the examples with configuration. Pay attention to the [WordCount 2.0](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v2.0) example as it provides a useful insight into how to structure your MapReduce classes.
### Organizing the project
The development process is done in teams. In the ideal world, before starting the project the team meets and discusses the details of their project, which modules they are going to implement, and who will be responsible for each particular one. Although there are no strict requirements on how you are going to work in teams, we encourage you to use the best practices you have learned so far. The only requirement that we have for you is to create a private git repository and do gradual commits as you progress. Please do not upload your work at the last moment. Use the benefits of version control and collaborative development.
### Testing
Depending on the size of the dataset, the processing can take a significant amount of time. To accelerate the development process, it is recommended to perform local testing on a small subset. You can download a small subset from the HDFS cluster using `hdfs dfs -get`.
### Libraries
Hadoop imposes some limitations on data types. The types used as output and input values in mapper and reducer should implement [`writable` interface](https://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/io/Writable.html). For the current task, you can write your own `writable` types, you can use any of the standard ones, and you can resort to string serialization if you think it is the answer for your problem. You can use libraries for JSON deserialization, hashing, data structures. The algorithmic part of the search engine should be original. Your code will be checked for plagiarism, and yes, our plagiarism detection system can easily detect renamed variables and permutations between lines.
### Passing arguments to a MapReduce task
Sometimes you need to pass additional arguments to your mapper or reducer. One of the simplest ways to do this is to pass the argument as a configuration string for the current job. During job configuration, write
```java
Configuration conf = new Configuration();
conf.set("test", "test value");
Job job = new Job(conf);
```
and call
```java
Configuration conf = context.getConfiguration();
String param = conf.get("test");
```
inside of your mapper or reducer. The main limitation is that your object should be serialized as a string. You should feel free to use other approaches to passing arguments with Hadoop.
Alternatively you can use `DistributedCache`. Placing a file in the cache is as simple as
```java
Job.addCacheFile(URI)
```
and accessing the files is as simple as
```java
URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
```
The `DistributedCache` copies the files to each of the nodes before the job has started, and it is assumed that the files are already present and are for reading purposes only. For more information refer to the [Apache manual](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#DistributedCache).
### Reading files from HDFS
Sometimes you might need to read files directly from HDFS. Luckily, it is easily accomplished with Hadoop libraries. The steps are simple
1. Create a `FileSystem` object from `Configuration`
2. Create `InputStream`
`FileSystem` supports many additional methods like listing directories and checking whether a file exists. For more information refer to the [class description](http://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/fs/FileSystem.html).
```java
FileSystem fs = FileSystem.get(configuration);
InputStream is = fs.open(path);
```
### Working with Custom Types
Hadoop provides many [default types](https://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/io/Writable.html). The current task can be implemented using only a tiny subset of them. If you decide to create your own custom types, you should also consider creating custom [`RecordReader`](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/RecordReader.html) and [`RecordWriter`](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/RecordWriter.html) classes. Many tutorials describe the ins and outs of this process available online ([for example](https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-recordreader-example/)). More substantial information about this topic can be found in the book *Hadoop The Definitive Guide*, chapter *II.8*, paragraphs *Input Formats*, *Output Formats*.
### Custom Comparators
For some tasks, namely ranking, you will need to set a custom comparator class, that will sort your output in a descending order. For this purpose, you can create a custom comparator that extends [`WritableComparator`](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/WritableComparator.html).
Custom comparator can be specified during job configuration
```java
job.setSortComparatorClass(ReverseComparator.class);
```
### Number of Reducers
Be aware of your ability to set the number of Reduce tasks with
```java
Job.setNumReduceTasks(int)
```
This might be important when your input data scales up.
### Memory Consumption
With the current cluster configuration, your mapper and reducer should not consume more than `728 MB` of ram. Otherwise, the container will be killed.
## Expected Outcome
### Using Your Search Application
Your program should provide convenient interfaces to use your MapReduce program. Hadoop containers are executed in the following way
```bash
hadoop jar /path/to/jar ExecutedClass arg1 arg2 arg3
```
You should provide a way to run the indexer on a set of inputs specified by the arguments, and a way to execute a query. The query and the length of the output list of relevant documents should be specified in the list of arguments. For example
```bash
hadoop jar /path/to/jar Indexer /path/to/input/files arg1 arg2 arg3
hadoop jar /path/to/jar Query arg1 arg2 arg3 10 "query text"
```
:::info
It is not necessary to pass the query as an argument. You can read it from `stdin`.
:::
You program should provide usage description if arguments are used incorrectly.
The output of the query, the list of wikipedia article titles and their rank, should be written to console.
The output of the query job should be stored on the disk.
## Executing MapReduce
First of all, connect to the cluster. Your team should have login and password. Otherwise ask your TA's for it.
```bash
ssh login@10.90.138.32
```
You should be in the university network.
In cluster all hadoop fies located at /hadoop/
For example, to show the content of your home directory in hdfs use:
```bash
/hadoop/bin/hdfs dfs -ls
```
Files to make the search on located at `/EnWikiSmall/` on hdfs. To reduce the workload on the cluster, you could create your index and search only on `AA*` files.
<!-- You can find istructions how to compile and run MapReduce tasks in https://hackmd.io/FiYlfKWXT_Wldqsuq_NArw# -->
### Reading Your Report
If you are unsure what you should include in the report, read this [short guide](https://hackmd.io/@9NHMbs3cSOmGDKDUbhIviQ/H1PMURS37). We will definitely look for team name, team members, contact information, link to repository, participants' contribution.
Your report should provide justifications for design choices.
### Viewing Your Repository
Please, take time to organize files in your repository
# FAQ
**Data Path in Hadoop**: `/EnWikiSmall`
**How to get access to cluster**: Team representative asks TA for credentials
**How to submit job on cluster**: You must be in the same subnet as the cluster. Normally, InnopolisStudent. Some say that InnopolisU also works.
**Testing**: You can test you code locally. For this, pull the data from HDFS.
**Version Control**: You can create private repository either on GirHub or BitBucket.
**External libraries**: Read section on libraries.
**Git Organization**: Structure your files. Activity in commits is also graded.
**Plagiarism**: Easily detected.
**Report should**:
1. Have structure
2. Include link to repository
3. Describe contribution of each participant
4. Exhibit coherency
5. Cite external references when used
6. Open by PDF reader software
7. Follow best practices of report-writing