Monday 10 February 2014

What is Metadata Management? Explain Integrated Metadata Management with a block diagram.

Metadata management can be defined as the end-to-end process and governance framework for creating, controlling, enhancing, attributing, defining and managing a metadata schema, model or other structured aggregation system, either independently or within a repository and the associated supporting processes.
The purpose of Metadata management is to support the development and administration of data warehouse infrastructure as well as analysis of the data of time.
Metadata widely considered as a promising driver for improving effectiveness and efficiency of data warehouse usage, development, maintenance and administration. Data warehouse usage can be improved because metadata provides end users with additional semantics necessary to reconstruct the business context of data stored in the data warehouse.
Integrated Metadata:
An integrated Metadata Management supports all kinds of users who are involved in the data warehouse development process. End users, developers and administrators can use/see the Metadata. Developers and administrators mainly focus on technical Metadata but make use of business Metadata if they want. Developers and administrators need metadata to understand transformations of object data and underlying data flows as well as the technical and conceptual system architecture.


Several Metadata management systems are in existence. One such system/ tool is Integrated Metadata Repository System (IMRS). It is a metadata management tool used to support a corporate data management function and is intended to provide metadata management services. Thus, the IMRS will support the engineering and configuration management of data environments incorporating e-business transactions, complex databases, federated data environments, and data warehouses / data marts. The metadata contained in the IMRS used to support application development, data integration, and the system administration functions needed to achieve data element semantic consistency across a corporate data environment, and to implement integrated or shared data environments.

Define the process of Data Profiling, Data Cleansing and Data Enrichment.

Data Profiling:
Data Profiling is the process of examining the data available in an existing data source and collecting statistics and infrmatio about the data. The purpose of these statistics may be to :

  1. Find out whether existing data can easily be used for other purpose.
  2. Five metrics on data quality including whether the data conforms to company standards.
  3. Access the risk involved in integrating data for new aplications, including the challenges of joins
  4. Track data quality.
  5. Access whether metadata accurately describes the actual values in the source database.

Data Cleansing:
Data cleansing or Data Scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.
Data cleansing involves the following tasks: 
  1. Converting data fields to common format 
  2. Correcting errors 
  3. Eliminating inconsistencies 
  4. Matching records to eliminate duplicates 
  5. Filling missing values etc. 
After cleansing, a data set will be consistent with other similar data sets in the system.

Data Enrichment 
Data Enrichment is the process of adding value to your data. In some cases, external data providers sell data, which may be used to augment existing data. In other cases, data from multiple internal sources are simply integrated to get the “big” picture. In any event, the intended result is a data asset that has been increased in value to the user community. 


Discuss the Extraction Methods in Data Warehouses

This function has to deal with numerous data sources. You have to employ the  appropriate technique for each data source.  Source data may be from different source machines in diverse data formats. Part of the source data may be in relational database system. some data may be on other legacy network and hierarchical data models. many  data sources may still be in flat files. You may want to include data from spreadsheets and local departmental data sets. Data extraction may become quite complex.
Tools are available on the market for data extraction. You may want to consider using outside tools suitable for certain data sources. For the other data sources, you may want to develop in-house programs to do the data extraction. Purchasing outside tools may entail high initial costs. In-house programs, on the other hand, may mean ongoing costs for development and maintenance.
After you extract the data , where do yo keep the data for further preparation? You may perform the extraction function in the legacy platform itself if that approach suits your framework. More frequently, Data warehouse implementation teams extract the source into a separate physical environment form which moving the data into the data warehouse would be easier. In the separate environment, you may extract the source data into a group of flat files, or a data-staging relational database, or a combination of both.

Describe the strengths of Dimensional Model as compared to E-R Model.

Dimensional Modeling (DM) is a favorite modeling technique in data warehousing. In DM, a
model of tables and relations is constituted with the purpose of optimizing decision support
query performance in relational databases, relative to a measurement or set of measurements of
the outcome(s) of the business process being modeled. In contrast, conventional E-R models
are constituted to (a) remove redundancy in the data model, (b) facilitate retrieval of individual
records having certain critical identifiers, and (c) therefore, optimize On-line Transaction
Processing (OLTP) performance.

The strenghts of Dimensional Model as compared to E-R model are as follows:
1. Dimensional modelling is very flexible for the user perspective. Dimensional data model is mapped for creating schemas. Where as ER Model is not mapped for creating shemas and does not use in conversion of normalization of data into denormalized form.
2. ER Model is utilized for OLTP databases that uses any of the 1st or 2nd or 3rd normal forms, where as dimensional data model is used for data warehousing and uses 3rd normal form.
3. ER model contains normalized data where as Dimensional model contains denormalized data.
4. ER modeling that models an ER diagram represents the entire businesses or applications processes. This diagram can be segregated into multiple Dimensional models. This is to say, an ER model will have both logical and physical model. The Dimensional model will only have physical model.
5. E-R modelling revovles around the Entities and their relationships to capture the overall process of the system. where as Dimensional model/Muti-Dimensinal Modelling revolves around Dimensions(point of analysis) for decison making and not to capture the process.


What is Data Mining? Explain the common techniques used in Data Mining.

In its simplest form, data mining automates the detection of relevant patterns in a database, using defined approaches and algorithms to look into current and historical data that can then be analyzed to predict future trends. Because data mining tools predict future trends and behaviors by reading through databases for hidden patterns, they allow organizations to make proactive, knowledge-driven decisions and answer questions that were previously too time-consuming to resolve.

Traditional Data Mining Tools: Traditional data mining programs help companies establish data patterns and trends by using a number of complex algorithms and techniques. Some of these tools are installed on the desktop to monitor the data and highlight trends and others capture information residing outside a database.

Dashboards: Installed in computers to monitor information in a database, dashboards reflect data changes and updates onscreen — often in the form of a chart or table — enabling the user to see how the business is performing. Historical data also can be referenced, enabling the user to see where things have changed (e.g., increase in sales from the same period last year). This functionality makes dashboards easy to use and particularly appealing to managers who wish to have an overview of the company's performance.

Text-mining Tools: The third type of data mining tool sometimes is called a text-mining tool because of its ability to mine data from different kinds of text — from Microsoft Word and Acrobat PDF documents to simple text files, for example. These tools scan content and convert the selected data into a format that is compatible with the tool's database, thus providing users with an easy and convenient way of accessing data without the need to open different applications.

Figure 12.1 shows a compression system consisting of two distinct structural blocks: an encoder and a decoder. An input image f(x, y) is fed into the decoder, which creates a set of symbols from the input data. After transmission over the channel, the encoded representation is fed to the decoder, where a reconstructed output image clip_image018[2] (x, y) is generated. In general, clip_image018[3] (x, y) may or may not be an exact replica of f(x, y). If it is, the system is error free for information preserving; if not, some level of distortion is present in the reconstructed image.
Both the encoder and decoder shown in Figure (12.1) consist of two relatively independent functions or sub blocks. The encoder is made up of a source encoder, which removes input redundancies, and a channel encoder, which increases the noise immunity of the source encoder s output. The decoder includes a channel decoder followed by a source decoder. If the channel between the encoder and decoder is noise free (not prone to error), the channel encoder and decoder are omitted, and the general encoder and decoder become the source encoder and decoder, respectively.

If you have book of BCA manipal university 6th sem (Image Processing)see in Page 176 for Diagram:

The Source Encoder and Decoder
The source is responsible for reducing or eliminating any coding, interpixel or psychovisual redundancies in the input image. The specific application and associated fidelity requirements dictate the best encoding approach to use in any given situation. Normally, the approach can be modeled by a series of three independent operations. As shown in Figure 12.2 (a), each operation is designed to reduce one of the three redundancies. Figure 12.2(b) depicts the corresponding source decoder.

If you have book of BCA manipal university 6th sem (Image Processing)see in Page 176 for Diagram:

In the first stage of the source encoding process, the mapper transforms the input data into a format to reduce interpixel redundancies in the input image. This operation generally is reversible and may or may not reduce directly the amount of data required to represent the image. Run-length coding is an example of a mapping that directly results in data compression in this initial state of the overall source encoding process. The representation of an image by a set of transforms coefficients is an example of the opposite case. Here the mapper transforms the image into an array of coefficients, making its interpixel redundancies more accessible for compression in later stages of the encoding process.
In the second stage, the quantizer block, reduces the accuracy of the mapper s output in accordance with some pre-established fidelity criterion. This stage reduces the psycho visual redundancies of the input image. As this operation is irreversible, thus it must be omitted when error-free compression is desired.
In the third and final stage, the symbol coder creates a fixed or variable-length code to represent the quantizer output and maps the output in accordance with the code. The term source coder distinguishes this coding operation from the overall source encoding process. In most cases, a variable-length code is used to represent the mapped and quantized data set. It assigns the shortest code words to the most frequently occurring output values and thus reduces coding redundancy. The operation, of course, is reversible. Upon completion of the symbol coding step, the input image has been processed to remove each of the three redundancies.
Figure 12.2(a) shows the source encoding process as three successive operations, but all the operations are not necessarily included in every compression system.
The source decoder shown in Figure12.2(b) contains only two components; a symbol decoder and an inverse mapper. These blocks perform the inverse operations of the source encoder s symbol encoder and mapper blocks. Because quantization results in irreversible information loss, an inverse quantizer block is not included in the general source decoder model.

Describe Thinning and Thickening.

Thinning
Thinning is a morphological operation that is used to remove selected foreground pixels from binary images, somewhat like erosion or opening. It can be used for several applications, but is particularly useful for skeletonization. In this mode it is commonly used to tidy up the output of edge detectors by reducing all lines to single pixel thickness. Thinning is normally only applied to binary images, and produces another binary image as output.
The thinning operation is related to the hit-and-miss transform, and so it is helpful to have an understanding of that operator before reading on.
How It Works
Like other morphological operators, the behavior of the thinning operation is determined by a structuring element. The binary structuring elements used for thinning are of the extended type described under the hit-and-miss transform (i.e. they can contain both ones and zeros).
The thinning operation is related to the hit-and-miss transform and can be expressed quite simply in terms of it. The thinning of an image I by a structuring element J is:
Eqn:eqnthn1
where the subtraction is a logical subtraction defined by Eqn:eqnlsub.

Thickening
Thickening is a morphological operation that is used to grow selected regions of foreground pixels in binary images, somewhat like dilation or closing. It has several applications, including determining the approximate convex hull of a shape, and determining the skeleton by zone of influence. Thickening is normally only applied to binary images, and it produces another binary image as output.

The thickening operation is related to the hit-and-miss transform, and so it is helpful to have an understanding of that operator before reading on.

How It Works
Like other morphological operators, the behavior of the thickening operation is determined by a structuring element. The binary structuring elements used for thickening are of the extended type described under the hit-and-miss transform (i.e. they can contain both ones and zeros).
The thickening operation is related to the hit-and-miss transform and can be expressed quite simply in terms of it. The thickening of an image I by a structuring element J is:
Eqn:eqnthk1
Thus the thickened image consists of the original image plus any additional foreground pixels switched on by the hit-and-miss transform.