Overview
My primary research interest lies in data mining with computational intelligence and its applications. In the past years, I focused my research on steganalysis and bioinformatics. New Mexico Tech is a successful National Security Agency (NSA) Center of Academic Excellence in Information Assurance Education. As a PhD student of the Computer Science Department, the core unit of the Information Assurance Education Center, I also took part in some other projects involving in information security.
Past Work
Steganography
Steganography is the art and science of hiding data in digital images, audios, videos and other media. Our group presented a novel image authentication scheme by embedding a fragile content-based cryptographic signature into image compression-domain. We also developed new algorithms to hide data in the wavelet transform domain and obtain a steganography with big hiding capacity; and designed the algorithm to obtain JPEG compression immune steganography by hiding data in the low frequency sub-band in the wavelet domain, combining with error correction encoding.
Fraud Detection and Obfuscated Code Scanner
Our group applied and expanded the pattern matching tree to mining user and system behaviors and implemented anomaly fraud detection in user level and system level [1, 2]. To detect obfuscated or polymorphic malicious code, we presented a signature-based mal-ware detection algorithm. The signature offers us a basis for detecting variants and mutants of the mal-ware in the future [3].
Dissertation Research
Steganalysis
Steganalysis is the art and science of detecting the presence of hidden data in steganography. After the 9.11 tragedy, concern arouse that terrorists communicate by hiding data in digital images. The detection of hiding behavior in some steganography systems such as LSB matching steganography in grayscale images is very challenging.
• One of my main contributions in this area is that I introduced the parameter of image complexity as one of the critical references to evaluate the detection performance of steganalysis. The experimental results clearly manifest that the significance of features and the detection performance closely depend on not only the measure of information hiding ratio but also the parameter of image complexity [4].
• Another main contribution is the proposal of a method to detect the hiding behavior of LSB matching steganography (one of the hardest steganography systems for detection) by using different features and pattern recognition techniques, and this method performs much better than those well-known steganalysis methods such as Histogram Characteristic Function Center Of Mass (HCFCOM) and High Order Moment statistics in Multi-Scale (HOMMS) decomposition domain [4, 5]. I also substantially expanded this work and extracted other features to improve the performance of detection in the steganalysis of LSB matching steganography in grayscale images, our method is superior to the method of HCFCOM, HOMMS, adjacent HCFCOM, and calibrated adjacent HCFCOM [6, 7]. Our method is successfully expanded to steganalysis of other steganography systems.
• In addition, I also presented the scheme of detecting the presence of hidden information in transform domain hiding based steganography such as CryptoBola, JPHS, and F5 steganography systems in JPEG images. The detection performance of our method is superior to HCFCOM and HOMMS [8].
Currently, the steganalysis of Steghide, a graphic theoretic approach based steganography is being investigated.
Bioinformatics
Biomedical research is being revolutionized by new technologies for generating high throughput data. In the past years, I focused my research on microarray data analysis and single nucleotide polymorphism (SNP) association study.
Microarray Gene Expression Analysis
My main contribution in this area is that I proposed a gene selection method to improve the classifications of gene expression data based on supervised learning and statistical measures of chosen and candidate features. The experimental result shows that that, in the classification of gene expression data, my method outperforms the well-known methods of Support Vector Machine Recursive Feature Elimination (SVMRFE), Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS), Gradient based Leave-one out Gene Selection (GLGS), etc. [9]. Currently, an improvement is being done.
Single Nucleotide Polymorphism (SNP) Association Study
As high throughput techologies are available in bioinformatics, the genotype and gene-environment data have high dimension of variables, the discocery of biomarker and genetic case/control association study is very important to SNP analysis. Most existing feature selection methods including modified test statistic-based approaches and model-based approaches such as logistic model or mixed models give highly correlated significant genes that are redundant for association study. We presented a scheme of support vector based lowest weight and supervised learning based lowest correlation feature addition for association study. The experimental results on myocardial infarction and rheumatoid arthritis Case/Control data sets indicate that our method outperforms some other well-known methods such as SVMRFE and logic regression based identification SNP interaction explanatory for the disease status in improving classifications of genetic variation-disease association study [10].
Differential Gene Expression on the Probe Level Data
I also worked on differentially expressed genes utilizing information provided by the probe level data instead of gene expression values collaboratively with researchers from Southern Methodist University and University of Texas Southwestern Medical Center.
• We presented a new summarization technique, the Distribution Free Weighted method (DFW), which uses information about the variability in probe behavior to estimate the extent of non-specific and cross-hybridization for each probe. The contribution of the probe is weighted accordingly during summarization, without making any distributional assumptions for the probe-level data [11].
• In the identification of differentially expressed genes, current gene selection methods suffer a lot from the so-called multiplicity problem of simultaneous hypothesis testing due to the fact of small sample size and large number of variables for microarray data. Gene selection methods that controlling false discover rate (FDR) have been proposed to deal with multiplicity issue. However, the dependence of genes within a cluster makes those methods difficult to estimate the number of true negatives and therefore those methods are limited for real microarray data analysis. We proposed a novel method for identifying differentially expressed genes (DEGs) using probe level data based identifying differentially expressed genes (PLIDEG). This new method utilizes information provided by the probe level data instead of gene expression values. With the extra information provided by probe level data, our new method, PLIDEG, can not only control type I error to be a very small value but also increase the power of detecting DEGs simultaneously. Therefore, PLIDEG can efficiently separate differentially expressed genes (DEGs) and non-DEGs without requiring estimating the number of none-DEGs. Based on theoretical analysis and real microarray data, we confirm those good features of this new method [12].
Currently, the discovery of biomarker on proteomics data set is being studied.
Future Directions
My future research interest will concentrate on the following directions, which are extensions of my past and current research.
Steganalysis
Although many improvements have been achieved, it is still highly challenging for the steganalysis of some steganographic systems. We introduced the parameter of image complexity; to this date it is still very difficult to detect the hiding behavior of the digital image with high complexity. To the best of my knowledge, no technique is presented to successfully detect the hiding information in the cases with low information hiding ratio. If the crafty terrorist hide information in the images with high complexity in low information hiding ratio, it is very important for homeland security to successfully detect these images.
Another important task is the extraction of payload or hidden information in the stegnograms. It seems that it is impossible for us to achieve this goal due to the uncertainty of the pixel values or transform coefficients of the images and lack of original covers for comparison. I have some ideas towards this goal.
Bioinformatics
The optimum of feature set is important in bioinformatics; I would like to combine previous work with other researchers’ contribution. Besides in-depth investigating microarray gene expression analysis and TagSNP association study, I would focus on structural and functional genomics and proteomics based on the following reasons:
DNA sequence information provides only a static snapshot of the various ways in which the cell might use its proteins whereas the life of the cell is a dynamic process. With this background, DNA/RNA sequences are not enough for the clear identification of a therapeutic target because proteins and not DNA/RNA are the basis of mode of action of drugs. Differential display proteomics for comparison of protein levels has potential application in a wide range of diseases. Many molecular markers of disease, the basis of diagnostics, are proteins patterns of protein expression can be used as a guide to drug design. Application of proteomics to study underlying pharmaceutical mechanisms and use these for drug development is referred to as pharmaceutical proteomics. Unlike classical genomic approaches that discover genes related to a disease, proteomics could characterize the disease process directly by finding sets of proteins that together participate in causing it.
In addition to the directions mentioned above, I also have great interests in biological networks, large scale systems data analysis, and other projects in bioinformatics. Because bioinformatics involves a variety of disciplines, including informatics, statistics, computer science, artificial intelligence, biology, collaboration is highly necessary. I am looking forward to joining your department and working with other faculty members and students.
References:
-
J. Xu, A. H. Sung, Q. Liu, “Behavior Mining for Fraud Detection”, Journal of Research and Practice in Information Technology 39(1): 3-18.
- J. Xu, A. H. Sung, Q. Liu, “Tree Based Behavior Monitoring for Adaptive Fraud Detection”, 18th International Conference on Pattern Recognition, ICPR (1): 1208-1211. (2006)
- J. Xu, A. H. Sung, S. Mukkamala, Q. Liu, “Obfuscated Malicious Executable Scanner”, Journal of Research and Practice in Information Technology, 39(3): 181-197.
- Q. Liu, A. H. Sung, J. Xu, B. Ribeiro, “Image Complexity and Feature Extraction for Steganalysis of LSB Matching Steganography”, 18th International Conference on Pattern Recognition, ICPR (2): 267-270. (2006)
- Q. Liu, A.H. Sung, B. Ribeiro, M. Wei, Z. Chen, and J. Xu, “Image Complexity and Feature Mining for Steganalysis of Least Significant Bit Matching Steganography”, Information Sciences 178(1): 21-36. doi: 10.1016/j.ins.2007.08.007.
- Q. Liu and A. H. Sung, “Feature Mining and Nuero-Fuzzy Inference System for Steganalysis of LSB Matching Steganography in Grayscale Images”, 20th International Joint Conference on Artificial Intelligence pp. 2808-2813.
- Q. Liu, A.H. Sung, Z. Chen, J. Xu, “Feature Mining and Pattern Classification for Steganalysis of LSB Matching Steganography in Grayscale Images”, Pattern Recognition 41 (1): 56-66. doi: 10.1016/j.patcog.2007.06.005.
- Q. Liu, A. H. Sung, J. Xu, V.Venkataramana, “Detect JPEG Steganography Using Polynomial Fitting”. Proc. of 16th Artificial Neural Networks in Engineering, ASME Press, 2006, pp. 547-556.
- Q. Liu, A. Sung, Z. Chen, "Gene Selection and Classification for Microarray Data based on Supervised Learning and Similarity Measures", (in review).
- Q. Liu, J. Yang, Z. Chen, M. Yang, A. Sung, X. Huang, "Supervised-Learning based TagSNPs for Genome-wide Disease Classification", to appear in the special issue of BMC Genomics.
- Z. Chen, M. McGee, Q. Liu, and R.H. Scheuermann, “A Distribution Free Summarization Method for Affymetrix GeneChip Arrays”, Bioinformatics 23(3):321-327, doi:10.1093/bioinformatics/btl609.
- Z. Chen, M. McGee, Q. Liu, M. Kong, and R.H. Scheuermann, “A Novel Method of Identifying Differentially Expressed Genes Based on Probe Level Data for GeneChip Arrays”, (in submitting).