A data mart for annotated protein sequence extracted from UniProt database

Show full item record

Title: A data mart for annotated protein sequence extracted from UniProt database
Author: Vyas, Maulik
Abstract: Data warehouses are used by various organizations to organize, understand and use the data with the help of provided tools and architectures to make strategic decisions. Biological data warehouse such as the annotated protein sequence database is subject oriented, volatile collection of data related to protein synthesis used in bioinformatics. Data mart contains a subset of enterprise data from data warehouse that is of value to a specific group of users. I implemented a data mart based on data warehouse design principles and techniques on protein sequence database using data provided by Swiss Institute of Bioinformatics. While the data warehouse contains information about many protein sequence areas, data mart focuses on one or more subject area. It brings together experimental results, computed features and scientific conclusions by implementing star schema and data cube that supports the data warehouse to make it easier for organizations to distribute data within a unit. This enables them to deploy the data, manipulate it and develop the protein sequence data any way they see fit. The main goal of this project is to provide consistent, accurate annotated protein sequence data to group of researchers working on protein sequence. I took a chunk of this data to extract it from warehouse, transform it and loaded it in staging area. I used HJSplit to split the XML protein sequence data into equal parts and extract information using XML editor. I populated the database tables in Microsoft Access 2010 from XML file. Once the database was set up, I used MySQL Workbench 5.2 CE to generate queries related to star schema. Finally, I implemented star schema, OLAP operations, and data cube and drill up-down operations for strategic analysis of protein sequence database based on SQL queries. This ensured explicit support for dimension, aggregation and long-range analysis.
Description: Project (M.S., Computer Science) -- California State University, Sacramento, 2011.
URI: http://hdl.handle.net/10211.9/1436
Date: 2012-01-13

Files in this item

Files Size Format View Description
Maulik_Vyas_MS_Project_Report_DW_on_UniProt.pdf 751.0Kb PDF View/Open Main Project-PDF
Maulik_Vyas_MS_Project_Report_DW_on_UniProt.doc 752.6Kb Microsoft Word View/Open Main Project-Word

This item appears in the following Collection(s)

Show full item record



Advanced Search

Browse

My Account