Методы бикластеризации для анализа интернет-данных

2008 г.

Методы бикластеризации для анализа интернет-данных

Дмитрий Игнатов,
кафедра анализа данных и искусственного интеллекта ГУ-ВШЭ

Заключение

В работе дан обзор методов бикластеризации, построена их решеточная таксономия. Показана связь ассоциативных правил и бикластеризации. Проведена серия экспериментов на реальных массивах Интернет-данных.

Предложена математическая модель ассоциативных метаправил для формирования рекомендаций в системах контекстной Интернет-рекламы. Для их построения использован ФАП, алгоритмы поиска ассоциаций, а также элементы компьютерной лингвистики, такие как стемминг. Во втором важном типе метаправил используется понятие онтологии.

Основное преимущество этих методов состоит в том, что без предварительного анализа самих данных можно сгенерировать высоко достоверные рекомендации. Это позволяет выдавать рекомендации уже на ранних этапах деятельности, когда клиентская база не велика. Результаты подтверждены экспериментально. Помимо этого показана состоятельность методов ФАП для выявления относительно крупных рынков рекламодателей и слов.

Предложена модель, в которой для поиска документов-дубликатов используются частые (замкнутые) множества признаков и ФАП. Экспериментально подтверждена обоснованность данной вычислительной модели на массиве РОМИП.

Для выявления сообществ посетителей сайтов и построения их таксономий предложено использовать аппарат ФАП. Применение индекса устойчивости позволяет проводить визуализацию относительно больших массивов данных о посещаемости, что является полезным средством для веб-аналитика.

В качестве возможных направлений дальнейших исследований отметим следующее:

изучение свойств бикластеров:
- определение степени перекрытия бикластеров, густоты внутри бикластера и густоты вне;
- оценка возможности определения порядка на бикластерах, анализ их алгебраической структуры;
- изучение связи некоторых параметров бикластеризации с индексами типа устойчивости;
уточнение и расширение оснований для классификации бикластеров для построения более полной и точной таксономии алгоритмов и методов;
исследование связи методов: шумоустойчивые понятия и бокс-кластеризация, ФАП и спектральная кластеризация.

Библиография

1

Биркгоф Г. Теория решеток. – М.:Наука, 1989.

2

Евтушенко С.А., Система анализа данных "Concept Explorer," труды 7-ой Национальной Конференции по Искусственному Интеллекту (КИИ-2000), Москва, 2000, с.127-134.

3

Игнатов Д.И., Кузнецов С.О. О поиске сходства Интернет-документов с помощью частых замкнутых множеств признаков // Труды 10-й национальной конференции по искусственному интеллекту с международным участием (КИИ’06). – М.:Физматлит, 2006, Т.2, стр.249-258.

4

С.А. Кедров, С.О. Кузнецов Исследование групп пользователей Интернет-ресурсами методами анализа формальных понятий и разработки данных (Data Mining)//Бизнес-информатика, №1—2007, стр. 45-51.

5

Кузнецов С.О. Устойчивость как оценка обоснованности гипотез, получаемых на основе операционального сходства// НТИ. Сер.2—1990. – N12. – С.21-29.

6

Кузнецов С.О., Игнатов Д.И., Объедков С.А., Самохин М.В. Порождение кластеров документов дубликатов: подход, основанный на поиске частых замкнутых множеств признаков. Интернет-математика 2005. Автоматическая обработка веб-данных. Москва: Яndex, 2005, стр. 302-319.

7

Объедков С.А., Алгоритмы и методы теории решеток и их применение в машинном обучении, Диссертация на соискание ученой степени кандидата технических наук. РГГУ, 2003.

8

Самохин М.В., Машинное обучение на узорных структурах, Диссертация на соискание ученой степени кандидата технических наук. ВИНИТИ, 2006.

9

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo, “Fast Discovery of Association Rules,” Advances in Knowledge Discovery, and Data Mining, U. Fayyad et al., eds., pp. 307-328, Menlo Park, Calif.: AAAI Press, 1996.

10

Agrawal R., Imielinski T., Swami A. Mining association rules between sets of items in large databases. Proceedings, ACM SIGMOD Conference on Management of Data, Washington D.C., pp. 207-216, 1993.

11

Amine Abou-Rjeili and George Karypis. Multilevel Algorithms for Partitioning Power-Law Graphs. IEEE International Parallel & Distributed Processing Symposium (IPDPS) (in press), 2006.

12

Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl, "Analysis of recommendation algorithms for e-commerce",ACM Conference on Electronic Commerce, pp. 158-167, 2000.

13

Barkow, S., Bleuler, S., Prelic, A., Zimmermann, P., and E. Zitzler. BicAT: a biclustering analysis toolbox, Bioinformatics, 2006 22(10):1282-1283.

14

Belohlavek R. Lattice type fuzzy order and closure operators in fuzzy ordered sets. Proc. Joint 9th IFSA World Congress and 20th NAFIPS International Conference, 2001, Vancouver, Canada, IEEE Press, pp. 2281-2286.

15

Belohlavek R., Vychodil V. What is a fuzzy concept lattice? In: Proc. CLA 2005, 3rd Int. Conference on Concept Lattices and Their Applications, September 7-9, 2005, Olomouc, Czech Republic, pp. 34-45.

16

Ben-Dor,A., Chor,B., Karp,R. and Yakhini,Z. Discovering local structure in gene expression data: the order-preserving sub-matrix problem. In Proceed-ings of the 6th Annual International Conference on Computational Biology, ACM Press, New York, NY, USA, pp. 49-57, 2002.

17

J. Besson, C. Robardet, J-F. Boulicaut. Constraint-based mining of formal concepts in transactional data. In: Proceedings of the 8th Pacific-Asia Con-ference on Knowledge Discovery and Data Mining, 2004.

18

J. Besson, C. Robardet, J-F. Boulicaut, and S. Rome. Constraint-based bi-set mining for biologically relevant pattern discovery in microarray data. Intelligent Data Analysis journal, 9(1) :59–82, 2004.

19

Besson, J., Robardet, C., Boulicaut, J.F.: Mining a New Fault-Tolerant Pattern Type as an Alternative to Formal Concept Discovery, In: Schärfe, H.,Hitzler, P., Øhrstrøm, P. (eds.) ICCS 2006. LNCS (LNAI), vol. 4068, pp. 144-157. Springer, Heidelberg (2006).

20

A. Broder, On the resemblance and containment of documents, in Proc. Compression and Complexity of Sequences (SEQS: Sequences'97)

21

A. Broder, M. Charikar, A.M. Frieze, M. Mitzenmacher, Min-Wise Independent Permutations, in Proc. STOC, 1998.

22

A. Broder, Identifying and Filtering Near-Duplicate Documents, in Proc. Annual Symposium on Combinatorial Pattern Matching, 2000.

23

Stanislav Busygin, Gerrit Jacobsen, and Ewald Kramer. Double conjugated clustering applied o leukemia microarray data. In Proceedings of the 2nd SIAM International Conference on Data Mining, Workshop on Clustering High Dimensional Data, 2002.

24

Andrea Califano, Gustavo Stolovitzky, and Yunai Tu. Analysis of gene expression microarays for phenotype classification. In Proceedings of the International Conference on Computacional Molecular Biology, pages 75–85, 2000.

25

Cheng,Y. and Church,G. Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. pp. 93-103, 2000.

26

J. Cho, N. Shivakumar, H. Garcia-Molina, Finding replicated web collections, 1999

27

A. Chowdhury, O. Frieder, D.A.Grossman, and M.C. McCabe, Collection statistics for fast duplicate document detection, ACM Transactions on Information Systems, 20(2): 171-191, 2002

28

Davey B. A., Priestley H. A. Introduction to Lattices and Order. – Cambridge: Cambridge University Press, 2002.

29

Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01), pages 269–274, 2001.

30

Inderjit S. Dhillon, Subramanyam Mallela, and Dharmendra S. Modha. Information-theoretical co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03), pages 89–98, 2003.

31

Eckes, T., and Orlik, P., An Error Variance Approach to Two-mode Hierachical Clustering, Journal of Classification, 10, 51-74.

32

Freeman, L.: Cliques, Galois lattices, and the structure of human social groups. Social Networks 18 (1996) 173–187.

33

Ganter, B., and Wille, R. Formal Concept Analysis: Mathematical Foundations, Springer, 1999.

34

G. Getz, E. Levine, and E. Domany. Coupled two-way clustering analysis of gene microarray data. In Proceedings of the Natural Academy of Sciences USA, pages 12079–12084, 2000.

35

G. Grahne and J. Zhu, Efficiently Using Prefix-trees in Mining Frequent Itemsets, in Proc. FIMI Workshop, 2003.

36

J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001.

37

Hartigan JA (1972). "Direct clustering of a data matrix". Journal of the American Statistical Association 67 (337): 123-129.

38

Ahmad M. Hasnah. A New Filtering Algorithm for Duplicate Document Based on Concept Analysis.Journal of Computer Science, Vol. 2 Issue 5, pp. 434-440, 2006.

39

P. Becker, J. H. Correia. “The ToscanaJ Suite for Implementing Conceptual Information Systems”. In: Formal Concept Analysis. Ed. by Bernhard Ganter, Gerd Stumme, and Rudolf Wille. Vol. 3626. Lecture Notes in Computer Science. Berlin, Heidelberg, and New York: Springer–Verlag, pp. 324–348,2005.

40

T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk , Evaluating Strategies for Similarity Search on the Web, in Proc. WWW'2002, Honolulu, 2002.

41

Bruce Hendrickson and Robert Leland. The Chaco User's Guide: Version 2.0, Sandia Tech Report SAND94-2692, 1994.

42

Thomas Hofmann and Jaz Puzicha. Latent class models for collaborative filtering. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 668–693, 1999.

43

Ihmels,J. et al. Defining transcription modules using large-scale gene expression, data. Bioinformatics, 20, 1993-2003, 2004.

44

S. Ilyinsky, M.Kuzmin, A. Melkov, I. Segalovich, An efficient method to detect duplicates of Web documents with the use of inverted index, in Proc. 11th Int. World Wide Web Conference (WWW'2002).

45

Yuval Klugar, Ronen Basri, Joseph T. Chang, and Mark Gerstein. Spectral biclustering of microarray data: coclustering genes and conditions. In Genome Research, volume 13, pages 703–716, 2003.

46

A. Kolcz, A. Chowdhury, J. Alspector, Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization, in Proc. KDD'04, Seattle, 2004.

47

S.O. Kuznetsov and S.A. Obiedkov, Comparing Performance of Algorithms for Generating Concept Lattices, Journal of Experimental and Theoretical Artificial Intelligence, vol. 14 (2002), pp. 189-216.

48

Kuznetsov, S.O.: On stability of a formal concept. In SanJuan, E., ed.: JIM, Metz,France (2003)

49

Kuznetsov, S. O.: On stability of a formal concept. Annals of Mathematics and Artificial Intelligence 49, 101-115 (2007).

50

Sergei O. Kuznetsov, Dmitrii I. Ignatov, Concept Stability for Constructing Taxonomies of Web-site Users//Proc. Satellite Workshop "Social Network Analysis and Conceptual Structures: Exploring Opportunities" at the 5th International Conference Formal Concept Analysis (ICFCA'07), Clermont-Ferrand,P. 19-24, 2007.

51

Laura Lazzeroni and Art Owen. Plaid models for gene expression data. Technical report, Stanford University, 2000.

52

Luxenburger M. Implications partielles dans un contexte. Mathématiques, Informatique et Sciences Humaines, 113 (29) : 35-55, 1991.

53

Jinze Liu and Wei Wang. Op-cluster: Clustering by tendency in high dimensional space. In Proceedings of the 3rd IEEE International Conference on Data Mining, pages 187–194, 2003.

54

Sara C. Madeira and Arlindo L. Oliveira, "Biclustering Algorithms for Biological Data Analysis: A Survey", IEEE/ACM Transactions on Computational Biology and Bioinformatics, VOL 1, NO. 1, pp. 24-45 January-March 2004.

55

Mirkin, B.G.: Additive Clustering and Qualitative Factor Analysis Methods for Similarity Matrices. Journal of classifiacation 4, 7-31 (1987).

56

Mirkin, B.G., Arabie, P., Hubert L.: Additive Two-Mode Clustering: The Error-Variance approach Revisited. Journal of classifiacation 12, 243-263 (1995).

57

Mirkin, B.G. Mathematical Classification and Clustering. Kluwer Academic Publishers, 1996.

58

Murali,T.M. and Kasif,S. Extracting conserved gene expression motifs from gene expression data. Pac. Symp. Biocomput., 8, 77-88, 2003.

59

N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Efficient Mining of Association Rules Using Closed Itemset Lattices, Inform. Syst., 24(1), 25-46, 1999.

60

Prelic, A., Bleuer, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evalua-tion of biclustering methods for gene expression data. Bioinformatics 22(9), 1122-1129, 2006.

61

W. Pugh, M. Henzinger, Detecting duplicate and near-duplicate files United States Patent 6658423 (December 2, 2003).

62

Matt Rasmussen and George Karypis. gCLUTO: An Interactive Clustering, Visualization, and Analysis System. UMN-CS TR-04-021, 2004.

63

Camille Roth, Sergei Obiedkov, Derrick G. Kourie: "Towards Concise Representation for Taxonomies of Epistemic Communities",CLA 4th International Conference on Concept Lattices and their Applications (2006)

64

Rome, J.E., Haralick, R.M.: Towards a formal concept analysis approach to exploring communities on the world wide web. In Ganter, B., Godin, R., eds.: ICFCA 2005. Volume 3403 of LNAI. (2005) 33–48

65

Roth, C., Bourgine, P.: Lattice-based dynamic and overlapping taxonomies: the case of epistemic communities. Scientometrics 69(2) (2006)

66

Eran Segal, Ben Taskar, Audrey Gasch, Nir Friedman, and Daphne Koller. Rich probabilistic models for gene expression. In Bioinformatics, volume 17 (Suppl. 1), pages S243–S252, 2001.

67

Eran Segal, Ben Taskar, Audrey Gasch, Nir Friedman, and Daphne Koller. Decomposing gene expression into cellular processes. In Proceedings of the Pacific Symposium on Biocomputing, volume 8, pages 89–100, 2003.

68

Qizheng Sheng, Yves Moreau, and Bart De Moor. Biclustering micrarray data by gibbs sampling. In Bioinformatics, volume 19 (Suppl. 2), pages ii196–ii205, 2003.

69

Shepard, R.N., and Arabie, P. Additive Clustering: Representation of Similarities as Comdinations of Discrete Overlapping Properties, Psyhological Review, 86, 87 -123 (1979).

70

N. Shivakumar, H. Garcia-Molina, Finding near-replicas of documents on the web. Proceedings of Workshop on Web Databases (WebDB'98), 1998.

71

Michael Steinbach, George Karypis and Vipin Kumar. A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining, 2000.

72

G. Stumme and R. Taouil and Y. Bastide and N. Pasqier and L. Lakhal. Computing Iceberg Concept Lattices with Titanic. J. on Knowledge and Data Engineering, (42)2:189-222,2002.

73

G. Stumme, A. Mädche. FCA Merge: Bottom-Up Merging of Ontologies. Proc.17th Intl. Conf. on Artificial Intelligence (IJCAI '01). Seattle, WA, USA, 225-230,2001.

74

L. Szathmary and A. Napoli. CORON: A Framework for Levelwise Itemset Mining Algorithms. In B. Ganter, R. Godin, and E. Mephu Nguifo, editors, Supplementary Proceedings of The Third International Conference on For-mal Concept Analysis ICFCA '05, Lens, France, pages 110-113, Feb 2005.

75

L. Szathmary, A. Napoli, and S. O. Kuznetsov. ZART: A Multifunctional Itemset Miner Algorithm. LORIA Research Report A05-R-013, Feb 2005.

76

Amos Tanay, Roded Sharan, and Ron Shamir. Discovering statistically significant biclusters in gene expression data. In Bioinformatics, volume 18 (Suppl. 1), pages S136–S144, 2002.

77

A. Tanay. R. Sharan, and R. Shamir, "Biclustering Algorithms: A Survey", In Handbook of Computational Molecular Biology, Edited by Srinivas Aluru, Chapman (2004)

78

Chun Tang, Li Zhang, Idon Zhang, and Murali Ramanathan. Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering, pages 41–48, 2001.

79

Lyle Ungar and Dean P. Foster. A formal statistical approach to collaborative filtering. In Proceedings of the Conference on Automated Learning and Discovery (CONALD’98), 1998.

80

Petko Valtchev, David Grosser, Cyril Roume, Mohamed Rouane Hacene, GALICIA: an open platform for lattices, in Using Conceptual Structures: Contributions to the 11th Intl. Conference on Conceptual Structures (ICCS'03), pp. 241-254, Dresde (DE), Shaker Verlag, 2003.

81

Haixun Wang, Wei Wang, Jiong Yang, and Philip S. Yu. Clustering by pattern similarity in large data sets. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 394–405, 2002.

82

White, D.R., Duquenne, V.: Social network & discrete structure analysis: Introduction to a special issue. Social Networks 18 (1996) 169–172.

83

Wille R. Restructuring Lattice Theory: an Approach Based on Hierarchies of Concepts // Ordered Sets / Ed. by I. Rival.– Dordrecht; Boston: Reidel, 1982. – P. 445–470.

84

Jiong Yang, Wei Wang, Haixun Wang, and Philip Yu.

-clusters: Capturing subspace correlation in a large data set. In Proceedings of the 18th IEEE International Conference on Data Engineering, pages 517–528, 2002.

85

Jiong Yang, Wei Wang, Haixun Wang, and Philip Yu. Enhanced biclustering on expression data. In Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering, pages 321–327, 2003.

86

Mohammed J. Zaki, Ching-Jui Hsiao, Efficient Algorithms for Mining Closed Itemsets and Their Lattice Structure, IEEE Transaction on Knowledge and Data Engineering, Vol 17, No. 4, pp. 462-478, 2005.

87

Y.Zhao and G. Karypis, Empirical and Theoretical Comparison of Selected Criterion Functions for Document Clustering. Machine Learning, vol. 55, pp. 311-331, 2004.

88

L.Е. Zhukov. Technical Report: Spectral Clustering of Large Advertiser Datasets Part I. Overture R&D, 2004.

Благодарности

В своей исходной версии данная работа представляет собой магистерское диссертационное исследование, поддержанное в рамках проекта «Учитель-ученики» ГУ-ВШЭ № 08-04-0022 «Разработка методов построения таксономий объектов на основе решеток формальных понятий и методов бикластеризации». Автор выражает благодарности аспиранту факультета ВМиК МГУ Кудрявцеву Юрию и доценту кафедры анализа данных и искусственного интеллекта ГУ-ВШЭ Объедкову Сергею за внимательное прочтение работы, выявление ошибок и ценные замечания. Отдельная благодарность Кудрявцеву Юрию за подготовку трансляции данного обзора из формата TeX в HTML.

Назад Содержание