Clinical documentation plays an important part in orthodontic care by tracking patient treatment progression. If clinical documentation is incomplete or erroneous, organizations are at risk of not following through on what needs to be done for a particular treatment. Documentation also provides a way of tracking inventory and treatment effectiveness, and it serves as a bulwark against litigation. Enter big data analytics. To reap the benefits of reduced operating costs and better patient experience, orthodontists must get better at harvesting structured data from their operations. A real-life scenario is discussed in this article focused on emergency visits from bracket bond failures using Oral4D. Using documentation, a practice (as illustrated in the example) can collect metrics on its operations, reorganize and collect metrics again to see if there is room for improvement.
Breast cancer biomarkers have received a considerable attention for their key role in detecting and preventing the causes of breast cancer. In this paper, we study the impact of the published research related to the top genes most frequently mentioned in breast cancer articles. Our study helps governments and organizations by giving an idea about the number of studies that probably needs to be targeted in their support and funds. We perform time series analysis for the most frequently mentioned biomarkers in breast cancer articles. Constructing our time series dataset involves Information Retrieval (IR), Entity Recognition (ER) and Information Extraction (IE). We build a time series for the most frequently mentioned biomarkers in breast cancer articles by computing the number of published articles that mentioned these biomarkers over a periodic period of time. We use the autoregressive moving average (ARIMA) to build a model that helps in understanding and predicting a future number of articles in the time series of the breast cancer biomarkers.
Breast cancer is a very serious disease that might lead to death, it is considered to be the most common cancer among women. A biomarker refers to a substance or process that serves as indication of disease in the body, where one common example of a disease biomarker is genetics. Researchers aim to identify biomarkers that might help in detecting and preventing the causes of certain disease. In this paper, we study the impact of the published research related to the top genes most frequently mentioned in breast cancer articles. Our study help governments and organizations by giving an idea about the number of studies that probably needs to be targeted in their support and funds. We perform time series analysis for the most frequently mentioned biomarkers in breast cancer articles. Constructing our time series dataset involves Information Retrieval (IR), Entity Recognition (ER) and Information extraction (IE). We build a time series for the most frequently mentioned biomarkers in breast cancer articles by computing the number of published articles that mentioned these biomarkers over a periodic period of time. We use autoregressive moving average (ARIMA) to build a model that helps in understanding and predicting future number of articles in the time series of the breast cancer biomarkers.
The tremendous research effort on diseases and drug discovery has produced a huge amount of important biomedical information which is mostly hidden in the web. In addition, many databases have been created for the purpose of storing enormous amounts of information and high-throughput experiments related to drugs and diseases' effects on genes. Thus, developing an algorithm to integrate biological data from different sources forms one of the greatest challenges in the field of computational biology. Based on our belief that data integration would result in better understanding for the drug mode of action or the disease pathophysiology, we have developed a novel paradigm to integrate data from three major sources in order to predict novel therapeutic drug indications. Microarray data, biomedical text mining data, and gene interaction data have been all integrated to predict ranked lists of genes based on their relevance to a particular drug or disease molecular action. These ranked lists of genes have finally been used as a raw material for building a disease–drug connectivity map based on the enrichment between the up/down tags of a particular disease signature and the ranked lists of drugs. Using this paradigm, we have reported 13% sensitivity improvement in comparison with using microarray or text mining data independently. In addition, our paradigm is able to predict many clinically validated disease–drug associations that could not be captured using microarray or text mining data independently.
Recent developments of complex graph clustering methods have implicated the practical applications with biological networks in different settings. Multi-scale Community Finder (MCF) is a tool to profile network communities (i.e., clusters of nodes) with the control of community sizes. The controlling parameter is referred to as the scale of the network community profile. MCF is able to find communities in all major types of networks including directed, signed, bipartite, and multi-slice networks. The fast computation promotes the practicability of the tool for large-scaled analysis (e.g., protein-protein interaction and gene co-expression networks). MCF is distributed as an open-source C++ package for academic use with both command line and user interface options, and can be downloaded at http://bsdxd.cpsc.ucalgary.ca/MCF. Detailed user manual and sample data sets are also available at the project website.
The work described in this paper is motivated by the fact that the structure of a website may not satisfy a larger population of the visiting users who may jump between pages of the website before they land on the target page(s); this is at least partially true because access patterns were not known when the website was designed. We developed a robust framework that tackles this problem by considering both web log data and web structure data to suggest a more compact structure that could satisfy a larger user group. The study assumes the trend recorded so far in the web log reflects well the anticipated behaviour of the users in the future. We separately analyse web log and web structure data using three techniques, namely clustering, frequent pattern mining and network analysis. The final outcome from the two stages is reflected on to one of the six models, namely the network of pages to report linking pages by the most appropriate connections.
Recent development in technology influenced our daily life and the way people communicate and store data. There is a clear shift from traditional methods to sophisticated techniques; this maximizes the utilization of the widely available digital media. People are able to take photos using hand held devices and there is a massive increase in the volume of photos digitally stored. Digital devices are also shaping the medical field. Scanners are available for every part of the body to help identifying problems. However, this tremendous increase in the number of digitally captured and stored images necessitates the development of advanced techniques capable of classifying and effectively retrieving relevant images when needed. Thus, content-based image retrieval systems (CBIR) have become very popular for browsing, searching and retrieving images from a large database of digital images with minimum human intervention. The research community is competing for more efficient and effective methods as CBIR systems may be heavily employed in serving time critical monitoring applications in homeland security, scientific and medical domains, among others. All of this motivated for the work described in this paper. We propose a novel approach which uses a well-known clustering algorithm k-means and a database indexing structure B+-tree to facilitate retrieving relevant images in an efficient and effective way. Cluster validity analysis indexes combined with majority voting are employed to verify the appropriate number of clusters. While searching for similar images, we consider images from the closest cluster and from other nearby clusters. We introduced two new parameters named cG and cS to determine the distance range to be searched in each cluster. These parameters enable us to find similar images even if the query image is misclustered and to further narrow down the search space for large clusters. To determine values of cG and cS, we introduced a new formula for gain measurement and we iteratively find the best gain value and accordingly set the values. We used Daubechies wavelet transformation for extracting the feature vectors of images. The reported test results are promising. The results demonstrate how using data mining techniques could improve the efficiency of the CBIR task without sacrificing much from the accuracy of the overall process.
Prediction is one of the most attractive aspects in data mining. Link prediction has recently attracted the attention of many researchers as an effective technique to be used in graph based models in general and in particular for social network analysis due to the recent popularity of the field. Link prediction helps to understand associations between nodes in social communities. Existing link prediction-related approaches described in the literature are limited to predict links that are anticipated to exist in the future. To the best of our knowledge, none of the previous works in this area has explored the prediction of links that could disappear in the future. We argue that the latter set of links are important to know about; they are at least equally important as and do complement the positive link prediction process in order to plan better for the future. In this paper, we propose a link prediction model which is capable of predicting both links that might exist and links that may disappear in the future. The model has been successfully applied in two different though very related domains, namely health care and gene expression networks. The former application concentrates on physicians and their interactions while the second application covers genes and their interactions. We have tested our model using different classifiers and the reported results are encouraging. Finally, we compare our approach with the internal links approach and we reached the conclusion that our approach performs very well in both bipartite and non-bipartite graphs.
Data in practical applications (e.g., images, molecular biology, etc) is mostly characterised by high dimensionality and huge size or number of data instances. Though, feature reduction techniques have been successful in reducing the dimensionality for certain applications, dealing with high dimensional data is still an area which has received considerable attention in the research community. Indexing and clustering of high dimensional data are two of the most challenging techniques that have a wide range of applications. However, these techniques suffer from performance issues as the dimensionality and size of the processed data increases. In our effort to tackle this problem, this paper demonstrates a general optimisation technique applicable to indexing and clustering algorithms which need to calculate distances and check them against some minimum distance condition. The optimisation technique is a simple calculation that finds the minimum possible distance between two points, and checks this distance against the minimum distance condition; thus reusing already computed values and reducing the need to compute a more complicated distance function periodically. Effectiveness and usefulness of the proposed optimisation technique has been demonstrated by applying it with successful results to clustering and indexing techniques. We utilised a number of clustering techniques, including the agglomerative hierarchical clustering, k-means clustering, and DBSCAN algorithms. Runtime for all three algorithms with this optimisation scenario was reduced, and the clusters they returned were verified to remain the same as the original algorithms. The optimisation technique also shows potential for reducing runtime by a substantial amount for indexing large databases using NAQ-tree; in addition, the optimisation technique shows potential for reducing runtime as databases grow larger both in dimensionality and size.
With the emergence of high speed internet applications and advanced Web 2.0 based Rich Internet Applications (i.e., blogs, wikis, etc.), it has become much easier for the users to publish data over the Web. This brings a challenge for the Web search solutions to let individual users find the right information as per their preferences, because traditional Web search engines have been built on “one size fits for all” concept. Different users of the Web may have different preferences. Search results for the same query raised by different users may differ in priority for individual users. In this book chapter, we present the extended version and results of our proposal on community-aware personalized Web search. It is quite challenging to know the preferences of the users by the search engines. We have designed and developed our unique approach of finding the preferences of users from the relevant parts of the user’s social network and community. We believe that the information related to the queries posed by the users may have strong correlation with the relevant information in their social networks. In order to find out personal interest and social-context, we find (1) activities of users in their social-network, and (2) relevant information from user’s social networks, based on our proposed trust and relevance matrices. We have further developed a mechanism that extracts from user’s social network information to be used to re-rank search results from a search engine. We also have discussed the implementation and evaluation details of our proposed solution.
The wide-spread usage of network and graph based approaches in modeling data has been approved to be effective for various applications. The network based framework becomes more powerful when it is expanded to benefit from the widely available techniques for data mining and machine learning which allow for effective knowledge discovery from the investigated domain. The underlying reason for the substantial efficacy in studying graphs, either directly (i.e., data is given in graph format, for example, the “phone-call” network in studying social evolutions) or indirectly (network is inferred from data by predefined method or scheme, such as co-occurrence network for studying genetic behaviors), is the fact that graph structures emphasize the intrinsic relationship between entities, i.e., nodes (or vertices) in the network (in this chapter, the terms network and graph are used interchangeably). For the indirect case information extraction techniques may be adapted to investigate open sources of data in order to derive the required network structure as reflected in the current available data. This is a tedious process but effective and could lead to more realistic and up-to-date information reflected in the network. The latter network will lead to better and close to real-time knowledge discovery in case online information extraction is affordable and provided. Estimating network structure has attracted the attention of other researchers involved in terrorist network analysis, e.g..
The rapid development in automated communication and the diversity of computing platforms necessitated and motivated for the development of platform independent data format that could smoothly provide for portability and extensibility. Intensive research efforts over the past two decades have produced XML as the de facto standard for platform independent sharing of data which is the most valuable commodity for maintaining successful and competitive performance. Data is the most valuable source of knowledge. Once data is acquired, it can be queried to retrieve explicit content and it can be mined to extract and predict implicit content. XML has been embraced as a data model mainly due to its simplicity, readability, and portability, i.e., its ability to be transported over well established protocols, such as HTTP. XML is extremely similar to HTML in structure, making it an ideal data format to be used in conjunction with HTTP. Furthermore, HTML parsers can be easily adapted for dealing with XML data. However, XML serves a purpose different from that of HTML. While the latter is intended for data formatting, the former specifies and describes structure and context for the data by allowing the user to decide on his/her own tags, structure, nesting, etc. Finally, XML documents can be highly structured, based on an accompanying XML Schema.
XML is attractive for data exchange between different platforms, and the number of XML documents is rapidly increasing. This raised the need for techniques capable of investigating the similarity between XML documents to help in classifying them for better organized utilization. In fact, the idea of similarity between documents is not new. However, XML documents are more rich and informative than classical documents in the sense that they encapsulate both structure and content; on the other hand, classical documents are characterized only by the content. According, using both the content and structure of XML documents to assign a similarity metric is relatively new. Of the recent research and algorithms proposed in the literature, the majority assign a similarity metric between 0.0 and 1.0 when comparing two XML documents. The similarity measures between multiple XML documents may be arranged in a matrix whereby data mining may be done to cluster closely related documents. In this chapter the authors have presented a novel way to represent XML document similarity in 3D space. Their approach benefits from the characteristics of the XML documents to produce a measure to be used in clustering and classification techniques, information retrieval and searching methods for the case of XML documents. We mainly derive a three dimensional vector per document by considering two dimensions as the document’s structural and content, while the third dimension is a combination of both structure and content characteristics of the document. The outcome from our research allows users to intuitively visualize document similarity.
The application of social network analysis (SNA) and mining in health care domains has recently received a considerable attention for its key role in understanding how doctors form communities, and how they are socially connected with each other. This understanding helps enhance organizational structures and process flows. In this paper, we show how SNA techniques can solve issues in the medical referral system by analyzing the social network of general practitioners (GPs) and specialists (SPs) associated with a medical referral system in the Canadian healthcare system and the like. Various SNA and mining procedures are proposed backed by experimental results.
Frequent pattern mining and consequently association rule mining is a useful technique for discovering relationships between items in databases. However, as the size of the data to be analyzed increases or the values of the pruning thresholds decrease, larger number of frequent pattern and more association rules will be generated with little information about the association rules in relation to each other. This research paper discusses a method to segment rules into different sets with no internal conflicts. The goal is to establish an effective method to reduce the difficulty for businesses to review the association rules of different customer segments, and track the behaviors of market segments based on their buying behaviors. The method established in this paper has the advantage of not needing customer information, thus removing the need for businesses to obtain customer information. This removes the threat of intrusions into customer privacy. The method also generates the rule sets based on conflicting rules, and dividing rules based on customer behaviors is more accurate than customer characteristics. The proposed method has been validated by running some tests.
Given a set of known classes, classification is a two steps process which uses part of the data to build a model capable of determining the class of new objects not used in the training phase. The accuracy of the classifier is one of the main criteria to judge its usefulness. However, most of the existing classification approaches decide on a single class for a given object. We argue that fuzzy classification is more attractive because it is closer to the real case where it is hard to identify a unique one class per object. To tackle this problem, we developed a framework which produces fuzzy association rules and uses them to build the classifier model. There are two important factors to consider: the method to create fuzzy association rules must be accurate, and the method to build a classifier must be accurate as well. In this paper, we will describe a method to perform fuzzy association rule mining and classification and we will test our results based on numerous factors including accuracy, varying levels of support and confidence.
Prediction is one of the most attractive aspects in data mining. Link prediction has recently attracted the attention of many researchers as an effective technique to be used in social network analysis to understand the associations between nodes in social communities. It has been shown in the literature that the link prediction technique is limited to predict the existence of the links in the future. To the best of our knowledge, none of the previous works in this area has explored the prediction of the links that could disappear in the future. In this paper, we propose a link prediction model that is capable of predicting link that might exist and links that may disappear in the future. The model has been successfully applied in two different domains, namely health care and stock market. We have tested our model using different classifiers and the reported results are encouraging.
Gene expression data is characterized by high dimensionality and small number of samples. Reducing the dimensionality is essential for effective analysis of the samples for efficient knowledge discovery. Actually, there is a tradeoff between feature selection and maintaining acceptable accuracy. The target is to find the reduction level or compact set of features which once used for knowledge discovery will lead to acceptable accuracy. Realizing the importance of dimensionality reduction for gene expression data, this paper presents novel framework which integrates dimensionality reduction with classification for gene expression data analysis. In other words, we present techniques for feature selection and demonstrate their effectiveness once coupled with data mining techniques for knowledge discovery. We concentrate on four feature selection techniques, namely chi-square, consistency subset, clustering-based and community-based. The effectiveness of the feature reduction techniques is demonstrated by coupling them with classification techniques, namely associative classification, support vector machines (SVM) and naive Bayesian classifier. The reported test results are encouraging; they demonstrate the applicability and effectiveness of the proposed framework.
In parallel to rapid growth of internet technologies, security becomes more critical in various real life applications such as e-finance, e-health, and e-government. These applications strictly require data authentication mechanisms. To address this essential issue, we grasp the idea of client based authenticity for interactive web technologies. We proposed a novel client based web form signing and checking with XML data structure method. Our method specifically uses XML structure for the involvement of data exchange between web applications. Our method curbs the DoS (Denial of Service) attacks for protection of the server. In order to illustrate our ideas, we adapted our digital signature mechanism on health related forms with two commonly used web browsers.
This paper presents a comprehensive approach for the transformation of ODL Schemas into XML Schemas. The approach starts with an incomplete set of rules described in the literature to assist in the transformation process. The fact that the rules provided a solid foundation for expansion, as well as the fact that the rules only cover a small subset of ODL, was our main motivation for continuing the study of this topic. In this paper, we first analyze an existing set of nine transformation rules. After evaluating the correctness and completeness of the rules, we proceed to propose some improvements and extensions into a more complete set of rules that cover the whole transformation process. By modifying the existing rule set, we are able to handle a much wider variety of ODL. Finally, we discuss some ODL scenarios that the original rule set cannot handle. This is meant to justify the need for the proposed extension as described in this paper. The presented more complete rule set is capable of handling a larger subset of ODL (including dictionaries, global and local scope enumerations, and most importantly, inheritance).
Traditional telephony last about 100 years in use as the basic voice communication because of its reliability that satisfy the normal needs. Packet switching network, especially the Internet, which rapidly spread , attract more and more applications, because of its flexibility and efficiency. One of the most important applications is VoIP, which transmit voice in the same method of transmitting data, taking into account that voice is a real time application. Once we can transmit the voice over IP network, we do not need to route calls using the expensive and large central switches of PSTN, since we can route the calls in the same manner that we route data in IP network. This is the task of the soft switch; soft switch may seem to be the counterpart of central office or PBX of PSTN network. Finally, this project outlines the overall system of VoIP communication, and then show how to implement a whole telephone system based on IP protocols. The implementation of this project will improve the telephony service and management in (IUG).