Monday 10 February 2014

What is Metadata Management? Explain Integrated Metadata Management with a block diagram.

Metadata management can be defined as the end-to-end process and governance framework for creating, controlling, enhancing, attributing, defining and managing a metadata schema, model or other structured aggregation system, either independently or within a repository and the associated supporting processes.
The purpose of Metadata management is to support the development and administration of data warehouse infrastructure as well as analysis of the data of time.
Metadata widely considered as a promising driver for improving effectiveness and efficiency of data warehouse usage, development, maintenance and administration. Data warehouse usage can be improved because metadata provides end users with additional semantics necessary to reconstruct the business context of data stored in the data warehouse.
Integrated Metadata:
An integrated Metadata Management supports all kinds of users who are involved in the data warehouse development process. End users, developers and administrators can use/see the Metadata. Developers and administrators mainly focus on technical Metadata but make use of business Metadata if they want. Developers and administrators need metadata to understand transformations of object data and underlying data flows as well as the technical and conceptual system architecture.


Several Metadata management systems are in existence. One such system/ tool is Integrated Metadata Repository System (IMRS). It is a metadata management tool used to support a corporate data management function and is intended to provide metadata management services. Thus, the IMRS will support the engineering and configuration management of data environments incorporating e-business transactions, complex databases, federated data environments, and data warehouses / data marts. The metadata contained in the IMRS used to support application development, data integration, and the system administration functions needed to achieve data element semantic consistency across a corporate data environment, and to implement integrated or shared data environments.

Define the process of Data Profiling, Data Cleansing and Data Enrichment.

Data Profiling:
Data Profiling is the process of examining the data available in an existing data source and collecting statistics and infrmatio about the data. The purpose of these statistics may be to :

  1. Find out whether existing data can easily be used for other purpose.
  2. Five metrics on data quality including whether the data conforms to company standards.
  3. Access the risk involved in integrating data for new aplications, including the challenges of joins
  4. Track data quality.
  5. Access whether metadata accurately describes the actual values in the source database.

Data Cleansing:
Data cleansing or Data Scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.
Data cleansing involves the following tasks: 
  1. Converting data fields to common format 
  2. Correcting errors 
  3. Eliminating inconsistencies 
  4. Matching records to eliminate duplicates 
  5. Filling missing values etc. 
After cleansing, a data set will be consistent with other similar data sets in the system.

Data Enrichment 
Data Enrichment is the process of adding value to your data. In some cases, external data providers sell data, which may be used to augment existing data. In other cases, data from multiple internal sources are simply integrated to get the “big” picture. In any event, the intended result is a data asset that has been increased in value to the user community. 


Discuss the Extraction Methods in Data Warehouses

This function has to deal with numerous data sources. You have to employ the  appropriate technique for each data source.  Source data may be from different source machines in diverse data formats. Part of the source data may be in relational database system. some data may be on other legacy network and hierarchical data models. many  data sources may still be in flat files. You may want to include data from spreadsheets and local departmental data sets. Data extraction may become quite complex.
Tools are available on the market for data extraction. You may want to consider using outside tools suitable for certain data sources. For the other data sources, you may want to develop in-house programs to do the data extraction. Purchasing outside tools may entail high initial costs. In-house programs, on the other hand, may mean ongoing costs for development and maintenance.
After you extract the data , where do yo keep the data for further preparation? You may perform the extraction function in the legacy platform itself if that approach suits your framework. More frequently, Data warehouse implementation teams extract the source into a separate physical environment form which moving the data into the data warehouse would be easier. In the separate environment, you may extract the source data into a group of flat files, or a data-staging relational database, or a combination of both.

Describe the strengths of Dimensional Model as compared to E-R Model.

Dimensional Modeling (DM) is a favorite modeling technique in data warehousing. In DM, a
model of tables and relations is constituted with the purpose of optimizing decision support
query performance in relational databases, relative to a measurement or set of measurements of
the outcome(s) of the business process being modeled. In contrast, conventional E-R models
are constituted to (a) remove redundancy in the data model, (b) facilitate retrieval of individual
records having certain critical identifiers, and (c) therefore, optimize On-line Transaction
Processing (OLTP) performance.

The strenghts of Dimensional Model as compared to E-R model are as follows:
1. Dimensional modelling is very flexible for the user perspective. Dimensional data model is mapped for creating schemas. Where as ER Model is not mapped for creating shemas and does not use in conversion of normalization of data into denormalized form.
2. ER Model is utilized for OLTP databases that uses any of the 1st or 2nd or 3rd normal forms, where as dimensional data model is used for data warehousing and uses 3rd normal form.
3. ER model contains normalized data where as Dimensional model contains denormalized data.
4. ER modeling that models an ER diagram represents the entire businesses or applications processes. This diagram can be segregated into multiple Dimensional models. This is to say, an ER model will have both logical and physical model. The Dimensional model will only have physical model.
5. E-R modelling revovles around the Entities and their relationships to capture the overall process of the system. where as Dimensional model/Muti-Dimensinal Modelling revolves around Dimensions(point of analysis) for decison making and not to capture the process.


What is Data Mining? Explain the common techniques used in Data Mining.

In its simplest form, data mining automates the detection of relevant patterns in a database, using defined approaches and algorithms to look into current and historical data that can then be analyzed to predict future trends. Because data mining tools predict future trends and behaviors by reading through databases for hidden patterns, they allow organizations to make proactive, knowledge-driven decisions and answer questions that were previously too time-consuming to resolve.

Traditional Data Mining Tools: Traditional data mining programs help companies establish data patterns and trends by using a number of complex algorithms and techniques. Some of these tools are installed on the desktop to monitor the data and highlight trends and others capture information residing outside a database.

Dashboards: Installed in computers to monitor information in a database, dashboards reflect data changes and updates onscreen — often in the form of a chart or table — enabling the user to see how the business is performing. Historical data also can be referenced, enabling the user to see where things have changed (e.g., increase in sales from the same period last year). This functionality makes dashboards easy to use and particularly appealing to managers who wish to have an overview of the company's performance.

Text-mining Tools: The third type of data mining tool sometimes is called a text-mining tool because of its ability to mine data from different kinds of text — from Microsoft Word and Acrobat PDF documents to simple text files, for example. These tools scan content and convert the selected data into a format that is compatible with the tool's database, thus providing users with an easy and convenient way of accessing data without the need to open different applications.

Figure 12.1 shows a compression system consisting of two distinct structural blocks: an encoder and a decoder. An input image f(x, y) is fed into the decoder, which creates a set of symbols from the input data. After transmission over the channel, the encoded representation is fed to the decoder, where a reconstructed output image clip_image018[2] (x, y) is generated. In general, clip_image018[3] (x, y) may or may not be an exact replica of f(x, y). If it is, the system is error free for information preserving; if not, some level of distortion is present in the reconstructed image.
Both the encoder and decoder shown in Figure (12.1) consist of two relatively independent functions or sub blocks. The encoder is made up of a source encoder, which removes input redundancies, and a channel encoder, which increases the noise immunity of the source encoder s output. The decoder includes a channel decoder followed by a source decoder. If the channel between the encoder and decoder is noise free (not prone to error), the channel encoder and decoder are omitted, and the general encoder and decoder become the source encoder and decoder, respectively.

If you have book of BCA manipal university 6th sem (Image Processing)see in Page 176 for Diagram:

The Source Encoder and Decoder
The source is responsible for reducing or eliminating any coding, interpixel or psychovisual redundancies in the input image. The specific application and associated fidelity requirements dictate the best encoding approach to use in any given situation. Normally, the approach can be modeled by a series of three independent operations. As shown in Figure 12.2 (a), each operation is designed to reduce one of the three redundancies. Figure 12.2(b) depicts the corresponding source decoder.

If you have book of BCA manipal university 6th sem (Image Processing)see in Page 176 for Diagram:

In the first stage of the source encoding process, the mapper transforms the input data into a format to reduce interpixel redundancies in the input image. This operation generally is reversible and may or may not reduce directly the amount of data required to represent the image. Run-length coding is an example of a mapping that directly results in data compression in this initial state of the overall source encoding process. The representation of an image by a set of transforms coefficients is an example of the opposite case. Here the mapper transforms the image into an array of coefficients, making its interpixel redundancies more accessible for compression in later stages of the encoding process.
In the second stage, the quantizer block, reduces the accuracy of the mapper s output in accordance with some pre-established fidelity criterion. This stage reduces the psycho visual redundancies of the input image. As this operation is irreversible, thus it must be omitted when error-free compression is desired.
In the third and final stage, the symbol coder creates a fixed or variable-length code to represent the quantizer output and maps the output in accordance with the code. The term source coder distinguishes this coding operation from the overall source encoding process. In most cases, a variable-length code is used to represent the mapped and quantized data set. It assigns the shortest code words to the most frequently occurring output values and thus reduces coding redundancy. The operation, of course, is reversible. Upon completion of the symbol coding step, the input image has been processed to remove each of the three redundancies.
Figure 12.2(a) shows the source encoding process as three successive operations, but all the operations are not necessarily included in every compression system.
The source decoder shown in Figure12.2(b) contains only two components; a symbol decoder and an inverse mapper. These blocks perform the inverse operations of the source encoder s symbol encoder and mapper blocks. Because quantization results in irreversible information loss, an inverse quantizer block is not included in the general source decoder model.

Describe Thinning and Thickening.

Thinning
Thinning is a morphological operation that is used to remove selected foreground pixels from binary images, somewhat like erosion or opening. It can be used for several applications, but is particularly useful for skeletonization. In this mode it is commonly used to tidy up the output of edge detectors by reducing all lines to single pixel thickness. Thinning is normally only applied to binary images, and produces another binary image as output.
The thinning operation is related to the hit-and-miss transform, and so it is helpful to have an understanding of that operator before reading on.
How It Works
Like other morphological operators, the behavior of the thinning operation is determined by a structuring element. The binary structuring elements used for thinning are of the extended type described under the hit-and-miss transform (i.e. they can contain both ones and zeros).
The thinning operation is related to the hit-and-miss transform and can be expressed quite simply in terms of it. The thinning of an image I by a structuring element J is:
Eqn:eqnthn1
where the subtraction is a logical subtraction defined by Eqn:eqnlsub.

Thickening
Thickening is a morphological operation that is used to grow selected regions of foreground pixels in binary images, somewhat like dilation or closing. It has several applications, including determining the approximate convex hull of a shape, and determining the skeleton by zone of influence. Thickening is normally only applied to binary images, and it produces another binary image as output.

The thickening operation is related to the hit-and-miss transform, and so it is helpful to have an understanding of that operator before reading on.

How It Works
Like other morphological operators, the behavior of the thickening operation is determined by a structuring element. The binary structuring elements used for thickening are of the extended type described under the hit-and-miss transform (i.e. they can contain both ones and zeros).
The thickening operation is related to the hit-and-miss transform and can be expressed quite simply in terms of it. The thickening of an image I by a structuring element J is:
Eqn:eqnthk1
Thus the thickened image consists of the original image plus any additional foreground pixels switched on by the hit-and-miss transform.

Explain all the types of Digital Images. Also differentiate between them.

When you want to put graphics on your website, you ll face an unexpected problem: what format should they be in? On their own computers, many people save pictures in Windows  default BMP (bitmap) format, but the files it creates are simply much too large to put on a website. They would take about a minute for visitors to download and use up all your bandwidth in the process.
When you put pictures on the web, you need to consider the trade-off you want between image quality and speed.

The following are the different types of image formats.
GIF:
The Graphics Interchange Format was developed in 1987 at the request of CompuServe, who needed a platform independent image format that was suitable for transfer across slow connections. It is a compressed (lossless) format (it uses the LZW compression) and compresses at a ratio of between 3:1 and 5:1.
The Graphical Interchange Format (GIF) is one of the most widely used image formats on the web. GIF files are recognizable by their .gif file extension. GIF is suitable for images with sharp edges and relatively few gradations of color, such as line art, cartoons, and text. You can also create background transparencies and animations using GIF images. It is an 8 bit format, which means the maximum number of colors supported by the format is 256.
There are two GIF standards, 87a and 89a (developed in 1987 and 1989 respectively). The 89a standard has additional features such as improved interlacing, the ability to define one color to be transparent and the ability to store multiple images in one file to create a basic form of animation. It is commonly used for fast loading web pages. It also makes a great banner or logo for your webpage. Animated pictures are also saved in GIF format. For example, a flashing banner would be saved as a Gif file.

JPEG:
JPEG (pronounced "jay-peg") is a standardized image compression mechanism. JPEG stands for Joint Photographic Experts Group, the original name of the committee that wrote the standard. JPEG compresses either full-color (24 bit) or grayscale images, and works best with photographs and artwork. For geometric line drawings, lettering, cartoons, computer screen shots, and other images with flat color and sharp borders, the PNG and GIF image formats are usually preferable. The extensions for JPEG are .jpg, .jpeg, .jpe.
JPEG uses a lossy compression method, meaning that the decompressed image isn't quite the same as the original. (There are lossless image compression algorithms, but JPEG achieves much greater compression than is possible with lossless methods.) This method fools the eye by using the fact that people perceive small changes in color less accurately than small changes in brightness.
JPEG was developed for two reasons: it makes image files smaller and it stores 24-bit per pixel color data (full color) instead of 8-bit per pixel data. Making image files smaller is important for storing and transmitting files. Being able to compress a 2MB full-color file down to, for example, 100KB makes a big difference in disk space and transmission time. JPEG can easily provide 20:1 compression of full-color data. (With GIF images, the size ratio is usually more like 4:1.)

TIFF:
Tagged Image File Format (TIFF) is a variable-resolution bit mapped image format developed by Aldus (now part of Adobe) in 1986. TIFF is very common for transporting color or gray-scale images into page layout applications, but is less suited to delivering web content.

The characterstics of TIFF are:

  1. TIFF files are large and of very high quality. Baseline TIFF images are highly portable; most graphics, desktop publishing, and word processing applications understand them.
  2. The TIFF specification is readily extensible, though this comes at the price of some of its portability. Many applications incorporate their own extensions, but a number of application-independent extensions are recognized by most programs.
  3. Four types of baseline TIFF images are available: bi-level (black and white), gray scale, palette (i.e., indexed), and RGB (i.e., true color). RGB images may store up to 16.7 million colors. Palette and gray-scale images are limited to 256 colors or shades. A common extension of TIFF also allows for CMYK images.
  4. TIFF files may or may not be compressed. A number of methods may be used to compress TIFF files, including the Huffman and LZW algorithms. Even compressed, TIFF files are usually much larger than similar GIF or JPEG files.
PNG:
PNG (pronounced ping as in ping-pong; for Portable Network Graphics) was developed as a replacement for the GIF standard, partly because of legal entanglements resulting from GIF's use of the patented LZW compression scheme, and partly because of GIF's many limitations. PNG files are recognizable by their .png file extension.

PNG is superior to GIF in many ways, offering the following features:

  1. Images that are the same size or slightly smaller than their GIF counterparts, while keeping lossless compression
  2. Support for indexed colors, gray-scale, and RGB (millions of colors)
  3. Support for 2-D progressive rendering, which is based on pixels rather than lines (as in interlaced GIFs and progressive JPEGs); this means that contents of a progressively rendered PNG file become apparent earlier in the load process
  4. An alpha channel that allows an image to have multiple levels of opacity; whereas GIFs only allow a given pixel to be fully transparent or fully opaque., This feature lets you create images with degrees of transparency, better blending images with their backgrounds
  5. Gamma correction, which allows you to correct for differences in how an image will appear on different computer display systems
  6. File integrity checks, which help prevent problems while downloading or transferring PNG files

PNG does not support multiple images within the same image file, which means that you can't make animations with PNG, as you can with GIFs. For transmission of some types of images (e.g., true-color photographs and black and white images), other file formats may give better results. Most graphics applications, and virtually all browsers support the PNG format.
Unlike the GIF89a, the PNG format doesn't support animation since it can't contain multiple images. The PNG is described as "extensible," however. Software houses will be able to develop variations of PNG that can contain multiple, scriptable images.

Explain the basic concepts of Sampling and Quantization

To create a digital image Convert the continuous sensed data into digital form. This involves two processes: sampling and quantization.
  Sampling: Digitizing the coordinate values.
  Quantization: Digitizing the amplitude values.


The basic idea behind sampling and quantization is illustrated in Fig. 2.5. Figure 2.5 (a) shows a continuous image, f (x, y), that we want to convert to digital form. An image may be continuous with respect to the x   and 
y-coordinates, and also in amplitude. To convert it to digital form, we have to sample the function in both coordinates and in amplitude. Digitizing the coordinate values is called sampling. Digitizing the amplitude values is called quantization.

State the applications of Image Processing and list the examples

Digital image processing has a broad spectrum of applications, such as remote sensing via satellites and other spacecrafts, image transmission and storage for business applications, medical processing, radar, sonar, robotics and automated inspection of industrial parts.
Images acquired by satellites are useful in tracking of earth resouces, geographical mapping, predicton of agricultural crops, urban growth, weather, flood and fire control and many other environmental applications. Space image applications include recognition and analysis of objects contained in images obtained from deep space-probe missions. Image transmission and storage applications occur in broadcast television, teleconferencing, transmission of facsimile images (printed documents and graphics) for office automation, commution over computer networks, closed-circuit television based security monitoring systems, and in military communications. In medical applications one is concerned with the processing of chest X-rays, cineamgiograms, projection images of transaxial tomography, and other medical images that occur in radiology, nuclear magnetic resonance (NMR) and ultrasonic scanning. These images may be used for patient screening and monitoring or for detection of tumors or other diseases in patients. Radar and sonar images are used for detection and recognition of various types of targets or in guidance and maneuvering of aircraft or missile systems.

The applications of image processing are:

  1. Agricultural (Fruit grading, harvest control, seeding, fruit picking ...)
  2. Communications (compression, video conferencing, television, ...)
  3. Character recognition (printed and handwritten)
  4. Commercial (Bar code reading, bank cheques, signature, ...)
  5. Document processing (Electronic circuits, mechanical drawings, 
  6. music, ...)
  7. Human (Heads and faces, hands, body, ...)
  8. Industrial (Inspection, part pose estimation and recognition, control...)
  9. Leisure and entertainment (museums, film industry, photography,...)
  10. Medical (X-rays, CT, NMR, ultrasound, intensity, ...)
  11. Military (Tracking, detection, etc)
  12. Police (Fingerprints, surveillance, DNA analysis, biometry, ...)
  13. Traffic and transport (Road, airport, seaport, license identification, ...)