BigDataInnopolis
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- tags: BigData-BS-2019 title: MapReduce. Assignment 1. Simple Text Indexer --- # Assignment №1. MapReduce. Simple Text Indexer **Due**: September 23, 2020 **Teams**: groups of 2 people **Submission Format**: report in PDF, link to git repository, compiled binary :::success You are highly encoureged to read a [short guide](https://hackmd.io/@9NHMbs3cSOmGDKDUbhIviQ/H1PMURS37) on how to prepare a report. It is not necessary... but... don't tell we didn't warn you... ::: In this assignment, you will practice on the university's Hadoop cluster. For this task, you are required to split into teams of 2 people. Please do this on your own capacity. Team representatives then should report to your TA to create a user on the cluster. We encourage you to start working ASAP. First days, you will have all the cluster resources, but on the last day before the deadline, many tasks could be launched at the cluster simultaneously. On the last day, your submissions could run much slower. Please, pay attention that running the job could take time, so it is your responsibility to finish the assignment before the deadline. Create a private git repository for this assignment. The personal commits are graded as well as the team performance. The assignment requires you to prepare a report. It should include a description of your solution and your results. Provide an analysis of your results. The report should also state the contribution of each team member. The final report should be submitted in PDF format. The exact way you prepare the report does not play a significant role. Although, keep in mind that you can use git to store the report in LaTeX or Markdown formats. [toc] ## Search Engine In the last lab assignment, you have familiarized yourself with writing programs in Hadoop's MapReduce. In this assignment, you are going to enhance your skill of writing MapReduce tasks in Java. The goal for this time is to implement a naive search engine. Conceptually, search engines were the first ones who tackled the problem of big data with the constraint of low latency response. Imagine an average search engine that has millions of documents in its index. Every second it receives hundreds to thousands of queries and requires to produce a list of the most relevant documents at sub-millisecond speed. The problem of finding relevant information is one of the key problems in the field of Information Retrieval, which can be subdivided into two related tasks: - Document indexing - Query answering While the second task is more related to low latency processing, the indexing step can be done offline and is more relevant to the idea of batch processing. Hadoop's MapReduce can be used to create the index of a text corpus that is too large to fit on one machine. For the sake of the assignment, you are going to do a naive implementation of both tasks using the MapReduce paradigm. The diagram for such a search engine is shown at the figure below. <p align="center"> <img src="https://www.dropbox.com/s/hkxjcimcx0g4lv0/Search%20Engine%20Flow.png?dl=1" width="250" /> </p> Before going into the details of MapReduce programming, let us view the basic approach to information retrieval. ## Basics of Information Retrieval for Text The most common task in IR is textual information retrieval. Whenever a user submits a query, the ranking engine should compute the set of the most relevant documents in its collection. To complete this task, the engineer should determine the representation format for both documents and queries, and define the measure of relevance of a query for a particular document. One of the most simple IR models is the TF/IDF Vector Space model. ### Vector Space Model for Information Retrieval To facilitate the understanding of the vector space model lets define a toy corpus that consists of three documents | Doc Id | Content | | -------- | -------- | | 1 | I wonder how many miles I’ve fallen by this time? | | 2 | According to the latest census, the population of Moscow is more than two million. | | 3 | It was a warm, bright day at the end of August. | | 4 | To be, or not to be? | | 5 | The population, the population, the population | To define the vector space model, we need to introduce three concepts: vocabulary, term frequency (TF), and inverse document frequency (IDF). #### Term A term is a unique word. #### Vocabulary Vocabulary is a set of uniques terms present in the corpus. For the example above the vocabulary can be defined as ``` {'a', 'according', 'at', 'august', 'bright', 'by', 'census', 'day', 'end', 'fallen', 'how', 'i', 'is', 'it', 'i’ve', 'latest', 'many', 'miles', 'million', 'more', 'moscow', 'of', 'population', 'than', 'the', 'this', 'time', 'to', 'two', 'warm', 'was', 'wonder', 'or', 'not', 'be'} ``` For the ease of the further description, associate each term in vocabulary with the unique id ``` (0, 'a'), (1, 'according'), (2, 'at'), (3, 'august'), (4, 'bright'), (5, 'by'), (6, 'census'), (7, 'day'), (8, 'end'), (9, 'fallen'), (10, 'how'), (11, 'i'), (12, 'is'), (13, 'it'), (14, 'i’ve'), (15, 'latest'), (16, 'many'), (17, 'miles'), (18, 'million'), (19, 'more'), (20, 'moscow'), (21, 'of'), (22, 'population'), (23, 'than'), (24, 'the'), (25, 'this'), (26, 'time'), (27, 'to'), (28, 'two'), (29, 'warm'), (30, 'was'), (31, 'wonder'), (32, 'or'), (33, 'not'), (34, 'be') ``` #### Term Frequency Term Frequency (TF) is the frequency of occurrence of a term $t$ in a document $d$. The previous documents can be represented with TF as follows, given the term format `(id, frequency)` | Doc Id | Content | | -------- | -------- | | 1 | `(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 1), (6, 0), (7, 0), (8, 0), (9, 1), (10, 1), (11, 1), (12, 0), (13, 0), (14, 1), (15, 0), (16, 1), (17, 1), (18, 0), (19, 0), (20, 0), (21, 0), (22, 0), (23, 0), (24, 0), (25, 1), (26, 1), (27, 0), (28, 0), (29, 0), (30, 0), (31, 1), (32, 0), (33, 0), (34, 0)` | | 2 | `(0, 0), (1, 1), (2, 0), (3, 0), (4, 0), (5, 0), (6, 1), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 1), (13, 0), (14, 0), (15, 1), (16, 0), (17, 0), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 0), (26, 0), (27, 1), (28, 1), (29, 0), (30, 0), (31, 0), (32, 0), (33, 0), (34, 0)` | | 3 | `(0, 1), (1, 0), (2, 1), (3, 1), (4, 1), (5, 0), (6, 0), (7, 1), (8, 1), (9, 0), (10, 0), (11, 0), (12, 0), (13, 1), (14, 0), (15, 0), (16, 0), (17, 0), (18, 0), (19, 0), (20, 0), (21, 1), (22, 0), (23, 0), (24, 1), (25, 0), (26, 0), (27, 0), (28, 0), (29, 1), (30, 1), (31, 0), (32, 0), (33, 0), (34, 0)` | | 4 | `(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0), (16, 0), (17, 0), (18, 0), (19, 0), (20, 0), (21, 0), (22, 0), (23, 0), (24, 0), (25, 0), (26, 0), (27, 2), (28, 0), (29, 0), (30, 0), (31, 0), (32, 1), (33, 1), (34, 2)` | | 5 | `(0, 0), (1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0), (16, 0), (17, 0), (18, 0), (19, 0), (20, 0), (21, 0), (22, 3), (23, 0), (24, 3), (25, 0), (26, 0), (27, 0), (28, 0), (29, 0), (30, 0), (31, 0), (32, 0), (33, 0), (34, 0)` | As you can see, this representation format is very sparse, and we could preserve space if we would store it as a sparse representation (i.e., remove all zero entries) | Doc Id | Content | | -------- | -------- | | 1 | `(5, 1), (9, 1), (10, 1), (11, 1), (14, 1), (16, 1), (17, 1), (25, 1), (26, 1)` | | 2 | `(1, 1), (6, 1), (12, 1), (15, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (27, 1), (28, 1)` | | 3 | `(0, 1), (2, 1), (3, 1), (4, 1), (7, 1), (8, 1), (13, 1), (21, 1), (24, 1), (29, 1), (30, 1)` | | 4 | `(27, 2), (32, 1), (33, 1), (34, 2)` | | 5 | `(22, 3), (24, 3)` | #### Inverse Document Frequency IDF show in how many documents the term has occurred. The measure signifies how common a particular term is. In case IDF is high, the presence of this term in the document does not help us to distinguish between documents. IDFs for our corpus are ``` (0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 2), (22, 2), (23, 1), (24, 3), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1) ``` You can see that the term `be` has occurred twice in the last document, but its IDF is only `1`. The term `the` occurred in document `2` and `3`; hence its IDF is `2`. Sometimes IDF defined as $$ IDF(t) = log \frac{N}{n(t)} $$ where $N$ is the overall number of documents in the collection, and $n(t)$ is the number of documents containing the term $t$. Other definitions are also possible. You are free to use any of them. #### TF/IDF Weights TF/IDF weights are nothing more than term frequencies normalized by IDF. | Doc Id | Content | | -------- | -------- | | 1 | `(5, 1), (9, 1), (10, 1), (11, 1), (14, 1), (16, 1), (17, 1), (25, 1), (26, 1)` | | 2 | `(1, 1), (6, 1), (12, 1), (15, 1), (18, 1), (19, 1), (20, 1), (21, 0.5), (22, 0.5), (23, 1), (24, 0.66), (27, 0.5), (28, 1)` | | 3 | `(0, 1), (2, 1), (3, 1), (4, 1), (7, 1), (8, 1), (13, 1), (21, 0.5), (24, 0.33), (29, 1), (30, 1)` | | 4 | `(27, 1), (32, 1), (33, 1), (34, 2)` | | 5 | `(22, 1.5), (24, 1)` | Here you can notice that terms with ids `{21, 22, 24, 27}` (`of`, `population`, `the`, `to`) are downscaled. #### Basic Vector Space Model In the basic vector space model, both documents and queries are represented with corresponding vectors, which capture TF/IDF weights of a document and the query. The simplest way to convert TF/IDF weights to a vector interpreted by computer is to index the array with word Ids and record TF/IDF value. In this case, the vector for the document `5` will be ``` [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` The query `the population` will result in a vector ``` [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.5, 0, 0.33, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` The function that determines the relevance of a document to the query is the inner product (scalar product) of the two vectors $$ r(q,d) = \sum_{i=1}^{|V|} q_i \cdot d_i $$ where $|V|$ is the size of the vocabulary, and $q_i$ is the TF/IDF weight if the $i^{th}$ term in the query. As you can see, the naive vector space representation with arrays is very sparse. Alternatively, you can create a sparse vector representation. The difference of a sparse vector from a regular vector is that sparse vector stores only non zero values. Such a vector can be implemented with a dictionary (in java terms - map). Sparse vector representation for the document `5` is ``` [(22: 1.5), (24: 1)] ``` and for the query ``` [(22: 0.5), (24: 0.33)] ``` The relevance function is easily reformulated as $$ r(q,d) = \sum_{i: i\in d, i\in q} q_i \cdot d_i $$ where the summation is performed over all the terms that are present in both the query and the document. From the implementation standpoint the dictionary structure is more appealing because the keys are unique and form sets. The common terms for a document and a query can be easily found by set intersection (wich is usually very fast). If we compute the relevance of the query `the population` for out toy corpus, the result would be ``` doc 5: 1.08 doc 2: 0.4678 doc 3: 0.1089 doc 4: 0 doc 1: 0 ``` As you can see, even though the document `5` contains arguably less information, it is still the first according to our relevance rating. ##### Room for speedup The indexing and retrieval steps experience a slowdown partially due to storing documents as arrays and having to map the word to Id with a `Map` data structure. This can be partially alleviated if we allow for some probability of mistake. Instead of creating an id for every word in the preprocessing step, and continuously reusing this mapping, we can generate an id on the fly using a hash function, completely eliminating word enumeration step. Depending on the allowable hash range, we can reduce the chance of collision significantly. The downside (or benefit) is added complexity for the retrieval engine, as now we must implement the vector space model strictly using sparse data structures. ### BM25 [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is widely accepted as a much better ranking function for ad-hoc (no relevance data available) information retrieval. The formula looks like this $$ r(q,d) = \sum_{i: i\in d, i\in q} IDF(d_i) \frac{d_i \cdot (k_1 + 1)}{d_i + k_1 \cdot (1 - b + b \cdot \frac{|d|}{avgdl})} $$ For the sake of simplicity, we can assume the same definition for IDF as above, $d_i$ is term frequency, $avgdl$ is the average document length in the corpus. $|d|$ is the size of the current document, $b$ and $k_1$ are free parameters (usually set to 0.75 and 2.0). This ranking function solves several problems with the vector space model approach, including the bias towards longer documents. ### Mean Average Precision [MAP](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) is a metric widely applied for measuring ranking quality. The notion of relevance is often applied to estimate the quality of the ranker. If a document is useful for answering the query, it is called relevant. If it has no meaningful connection to the query - the document is not relevant. The search engine returns the ordered list of documents, sorted by descending relevance. We want the top of the list to contain all of the relevant document for our query. Assume that your ranker retrieves a list of documents of length $N_l$. MAP is the average of a metric called Average Precision over some set of queries $$ MAP = \frac{1}{|Q|} \sum_{q \in Q} AP(q) $$ The Average Precision is calculated as $$ AP = \frac{1}{N_{rel}}\sum_{k=1}^{N_{l}} P(k) \cdot \text{rel}(k) $$ where $N_rel$ is the number of relevant document returned by the ranker, $P(k)$ is the precision calculated for the positions from 1 to k, $\text{rel}(k)$ is an indicator function that is equal 1 when document at position $k$ is a relevant document, and 0 otherwise. ## Vector Space Model with MapReduce One of possible ways to implement a naive search engine is shown on the diagram below. <p align="center"> <img src="https://www.dropbox.com/s/wo2b6amni66djzj/Search%20Engine%20with%20MapReduce.png?dl=1" width="300" /> </p> Given the collection of the documents, we first submit it to the indexing engine. The indexing engine starts a couple of tasks to assess the entire corpus. 1. Word Enumerator scans the corpus and creates a set of unique words. After that, it assigns a unique id to each word. This task can be implemented with MapReduce. 2. Document Count does a similar task, but instead of creating a set of unique terms it aims to count the IDF for each term, or simply the number of documents where each particular term appeared. This task can be implemented with MapReduce. 3. Vocabulary is an abstract concept and does not actually require to compute anything. It simply aggregates the result of Word Enumerator and Document Count into one data structure. 4. The Indexer has to compute a machine-readable representation of the whole document corpus. For each document, the Indexer creates a TF/IDF representation and stores a tuple of `(doc id, vector representation)`. Since each document is processed independently, this can be implemented with MapReduce. After the index is created, it can be reused multiple times. The Ranking Engine has to create a vectorized representation for the query and perform the relevance analysis 1. The Query Vectorizer is the method that creates the vectorized representation of a query. It can be implemented as a part of the Relevance Analizator. 2. Relevance Analizator computes the relevance function between the query and each document. This task can be implemented with MapReduce (the performance depends on available hardware). 3. The index stores the document id and the vectorized representation. The Ranker provides the list of ids, sorted by relevance. The Content Extractor should extract the relevant content from the text corpus based on the provided relevant ids. The Query Response contains the list of relevant content (for example links with titles). ***The model presented above shows one possible way to implement the system in a way that can be mapped into the MapReduce paradigm. You are not constrained to following this architecture, and the only limitation is that the whole system should be implemented with Hadoop's MapReduce.*** ## Dataset Description For the sake of excersize we are going to work with English Wikipedia Dump ([freely available](https://dumps.wikimedia.org/backup-index.html)). The dump is extracted in JOSN format: ```json {"id": "0", "url": "https://en.wikipedia.org/wiki?curid=0", "title": "Sample Title", "text": "Sample Text"} ``` Later you will have the access to Hadoop cluster with HDFS. The Wikipedia dump resides at the location `/EnWikiSmall` on HDFS. The folder `EnWikiSmall` contains many files, each file contains multiple JSON strings, each string corresponding to one Wikipedia article. In your Search Engine Implementation, you will need to work with a subset of the Wikipedia dump and provide it as an input to several MapReduce jobs. You can use external JSON libraries for parsing the input. The classes `org.json.JSONObject` and `org.json.JSONException` are already available on the cluster. ## Developing The first thing you should do is to study [Apache's MapReduce guide](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html). It provides describes MapReduce capabilities and the examples with configuration. Pay attention to the [WordCount 2.0](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v2.0) example as it provides a useful insight into how to structure your MapReduce classes. ### Organizing the project The development process is done in teams. In the ideal world, before starting the project the team meets and discusses the details of their project, which modules they are going to implement, and who will be responsible for each particular one. Although there are no strict requirements on how you are going to work in teams, we encourage you to use the best practices you have learned so far. The only requirement that we have for you is to create a private git repository and do gradual commits as you progress. Please do not upload your work at the last moment. Use the benefits of version control and collaborative development. ### Testing Depending on the size of the dataset, the processing can take a significant amount of time. To accelerate the development process, it is recommended to perform local testing on a small subset. You can download a small subset from the HDFS cluster using `hdfs dfs -get`. ### Libraries Hadoop imposes some limitations on data types. The types used as output and input values in mapper and reducer should implement [`writable` interface](https://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/io/Writable.html). For the current task, you can write your own `writable` types, you can use any of the standard ones, and you can resort to string serialization if you think it is the answer for your problem. You can use libraries for JSON deserialization, hashing, data structures. The algorithmic part of the search engine should be original. Your code will be checked for plagiarism, and yes, our plagiarism detection system can easily detect renamed variables and permutations between lines. ### Passing arguments to a MapReduce task Sometimes you need to pass additional arguments to your mapper or reducer. One of the simplest ways to do this is to pass the argument as a configuration string for the current job. During job configuration, write ```java Configuration conf = new Configuration(); conf.set("test", "test value"); Job job = new Job(conf); ``` and call ```java Configuration conf = context.getConfiguration(); String param = conf.get("test"); ``` inside of your mapper or reducer. The main limitation is that your object should be serialized as a string. You should feel free to use other approaches to passing arguments with Hadoop. Alternatively you can use `DistributedCache`. Placing a file in the cache is as simple as ```java Job.addCacheFile(URI) ``` and accessing the files is as simple as ```java URI[] patternsURIs = Job.getInstance(conf).getCacheFiles(); ``` The `DistributedCache` copies the files to each of the nodes before the job has started, and it is assumed that the files are already present and are for reading purposes only. For more information refer to the [Apache manual](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#DistributedCache). ### Reading files from HDFS Sometimes you might need to read files directly from HDFS. Luckily, it is easily accomplished with Hadoop libraries. The steps are simple 1. Create a `FileSystem` object from `Configuration` 2. Create `InputStream` `FileSystem` supports many additional methods like listing directories and checking whether a file exists. For more information refer to the [class description](http://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/fs/FileSystem.html). ```java FileSystem fs = FileSystem.get(configuration); InputStream is = fs.open(path); ``` ### Working with Custom Types Hadoop provides many [default types](https://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/io/Writable.html). The current task can be implemented using only a tiny subset of them. If you decide to create your own custom types, you should also consider creating custom [`RecordReader`](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/RecordReader.html) and [`RecordWriter`](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/RecordWriter.html) classes. Many tutorials describe the ins and outs of this process available online ([for example](https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-recordreader-example/)). More substantial information about this topic can be found in the book *Hadoop The Definitive Guide*, chapter *II.8*, paragraphs *Input Formats*, *Output Formats*. ### Custom Comparators For some tasks, namely ranking, you will need to set a custom comparator class, that will sort your output in a descending order. For this purpose, you can create a custom comparator that extends [`WritableComparator`](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/WritableComparator.html). Custom comparator can be specified during job configuration ```java job.setSortComparatorClass(ReverseComparator.class); ``` ### Number of Reducers Be aware of your ability to set the number of Reduce tasks with ```java Job.setNumReduceTasks(int) ``` This might be important when your input data scales up. ### Memory Consumption With the current cluster configuration, your mapper and reducer should not consume more than `728 MB` of ram. Otherwise, the container will be killed. ## Expected Outcome ### Using Your Search Application Your program should provide convenient interfaces to use your MapReduce program. Hadoop containers are executed in the following way ```bash hadoop jar /path/to/jar ExecutedClass arg1 arg2 arg3 ``` You should provide a way to run the indexer on a set of inputs specified by the arguments, and a way to execute a query. The query and the length of the output list of relevant documents should be specified in the list of arguments. For example ```bash hadoop jar /path/to/jar Indexer /path/to/input/files arg1 arg2 arg3 hadoop jar /path/to/jar Query arg1 arg2 arg3 10 "query text" ``` :::info It is not necessary to pass the query as an argument. You can read it from `stdin`. ::: You program should provide usage description if arguments are used incorrectly. The output of the query, the list of wikipedia article titles and their rank, should be written to console. The output of the query job should be stored on the disk. ## Executing MapReduce First of all, connect to the cluster. Your team should have login and password. Otherwise ask your TA's for it. ```bash ssh login@10.90.138.32 ``` You should be in the university network. In cluster all hadoop fies located at /hadoop/ For example, to show the content of your home directory in hdfs use: ```bash /hadoop/bin/hdfs dfs -ls ``` Files to make the search on located at `/EnWikiSmall/` on hdfs. To reduce the workload on the cluster, you could create your index and search only on `AA*` files. <!-- You can find istructions how to compile and run MapReduce tasks in https://hackmd.io/FiYlfKWXT_Wldqsuq_NArw# --> ### Reading Your Report If you are unsure what you should include in the report, read this [short guide](https://hackmd.io/@9NHMbs3cSOmGDKDUbhIviQ/H1PMURS37). We will definitely look for team name, team members, contact information, link to repository, participants' contribution. Your report should provide justifications for design choices. ### Viewing Your Repository Please, take time to organize files in your repository # FAQ **Data Path in Hadoop**: `/EnWikiSmall` **How to get access to cluster**: Team representative asks TA for credentials **How to submit job on cluster**: You must be in the same subnet as the cluster. Normally, InnopolisStudent. Some say that InnopolisU also works. **Testing**: You can test you code locally. For this, pull the data from HDFS. **Version Control**: You can create private repository either on GirHub or BitBucket. **External libraries**: Read section on libraries. **Git Organization**: Structure your files. Activity in commits is also graded. **Plagiarism**: Easily detected. **Report should**: 1. Have structure 2. Include link to repository 3. Describe contribution of each participant 4. Exhibit coherency 5. Cite external references when used 6. Open by PDF reader software 7. Follow best practices of report-writing

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully