Please use this identifier to cite or link to this item:
http://dspace.iiitb.ac.in:8080/handle/123456789/68
Title: | Mining Semantics from unstructured text using a cognitive model on lexical cooccurances |
Authors: | Rachakonda, Aditya Ramana |
Keywords: | Engineering and Technology Computer Science Computer Science Information Systems |
Issue Date: | 20-Dec-2013 |
Publisher: | International Institute of Information Technology Bangalore |
Abstract: | Mining latent semantics from text corpora is an important problem with several applications, and models to explain how semantic associations get embedded into natural language are of immense interest. In this thesis, we take a cognitive modelling approach to this problem.The text corpus is viewed as a cognitive space, and each document is treated as a cognitive episode whose objective is to convey some meaning within some context. Semantic communication is proposed to take place in a 3-layer model. The topmost layer, called the analytic layer, maintains the semantic worldview of the communicator. Communication happens in the context of an episode, where semantics from the analytic layer are combined with the episodic objectives to create conceptual idea units. This happens in the second layer called the episodic layer. The last layer, called the linguistic layer, provides language specific vocabularies and inflections that convert conceptual idea units into communication units like speech utterances, gestures or written text. Mining latent semantics becomes the problem of inferring analytic layer associations by observing linguistic layer recordings. This is addressed by defining the problem in the analytic layer; proposing an episodic hypothesis on observations of episodic patterns; and extracting semantics from patterns of co-occurrences of terms based on the hypothesis. In this thesis, we define three semantic associations and by using the 3-layer cognitive model, we reliably mine them from co-occurrences observed in the English language Wikipedia. The first of these associations, called topical anchors, represents the topic of a given a short piece of text, like a sentence. We found that our algorithm could identify the topical anchor with an accuracy similar to that of a cluster labelling algorithm but instead of documents the input was just a small set of words to describe an episode. For example, given the statement, ‘Andy Murray defeated Roger Federer’, we identified that ‘Tennis’ is a term which semantically represents the topic of the statement.The second semantic association, called semantic siblings, represents terms which belong to a hypothetical set and semantically share a conceptual parent. We found that using co-occurrences and without using structural cues or domain specific cues, we could extract semantic sibling terms of a small set of terms similar to that of set expansion algorithms. For instance, given a seed set of English Premier League football teams like, {‘Manchester United’, ‘Arsenal’, ‘Chelsea’}, we identified that terms like ‘Liverpool’ and ‘New Castle United’ belongs to the same set. Finally, we propose a semantic association called topical markers. Topical markers are terms which are unique to a specific context and their presence in a sentence helps to unambiguously identify its topic. Given an arbitrary topic and a co-occurrence graph, we found that we could extract terms which are unique to the topic. We also found that the topical markers, when used as query terms in a generic search engine, resulted in pages which are unique to the topic in question. Using these semantic associations we confirm the validity of our hypotheses and in turn the validity of the cognitive model proposed in this thesis. |
Description: | xv, 128p. |
URI: | http://dspace.iiitb.ac.in/handle/123456789/68 |
Appears in Collections: | 1. PHD Thesis |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
PH2006901-Aditya Ramana Rachakonda.pdf Until 2025-08-19 | PH2006901-Aditya Ramana Rachakonda Thesis | 3.81 MB | Adobe PDF | View/Open Request a copy |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.