Sampling and Summarization for Social Networks

Abstract

With the growing popularity of online social network services such as Twitter and Facebook, social network analysis has attracted much attention in recent years. The massive amount of data created everyday on social network services have become a great source of information for different purposes such as sociology study and marketing analysis. However, the growth in scale of the real-world social networks (usually contains millions of nodes and edges) has imposed great challenges in information extraction, processing, and analysis for humans or even computers. Usually it is unrealistic to assume the full network is known, or can be processed efficiently at once. Therefore, it is crucial for data miners to devise an effective, efficient, and systematic approach to handle cases when the size of the network is too large to be handled.

Generally there are two mainstream strategies to handle large-scaled networks, sampling and summarization. In social network sampling, it is assumed that the full network is unseen or impossible to be obtained. Techniques to sample a sub-network with the goal to preserve some specific properties of the original network are needed. For example, in order to study the existence of ‘six-degree separation’ phenomenon in an online social network such as Facebook, one need to develop a crawler to extract partial network data from the social network for experiments, assuming the complete network data is not open to public. To make a convincing conclusion, one then needs to make sure the sampling process can faithfully preserve the length distribution of shortest path between nodes.

Social network summarization (some researchers call it social network compression), on the other hand, aims at addressing different issues. In summarization, the entire network structure is known in prior, but is usually too big or complex for human to visualize and for machine to store and process efficiently. In this sense, graph summarization algorithms aim at devising strategies to condense the network as much as possible without losing too much information. The summarized network can be visualized more clearly, stored more efficiently, and processed more easily.

Tutorial Outline

I. Introduction (15 min)

II. Sampling in Social Networks (75 min)

III. Summarization in Social Networks (75 min)

IV. Conclusion and Q/A (15 min)

Tutorial Slides

Tutorial slides can be downloaded here.

Specific Goals and Objectives

The goal of this tutorial is to introduce and compare the state-of-the-art solutions for social network sampling and summarization. We also highlight the research challenges and unsolved issues in these areas to encourage advanced works in these directions.

Expected Background of the Audience

We do not require the audiences to have any background knowledge on statistical sampling techniques, nor solid training on data mining. However, we expect the audience already understand some basic concepts and terminologies on social network analysis such as Power Law distribution, PageRank, and small-world phenomenon.

Presenters

Shou-de Lin holds a BS in EE from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from University of Southern California. In 2007, he joined the CSIE Department of National Taiwan University as an assistant professor. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of knowledge discovery and data mining, social network analysis, natural language processing, and machine learning. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, US Airforce AOARD Research Award 2011, and best paper award in ASONAM 2011. He is one of the all-time winners in ACM KDD Cup, leading or co-leading the NTU teams to win championships in 2008 (co-champion with IBM Research), 2010 (student team champion and overall team champion), and 2011 (dual champions in both tracks), 2012 (champion in Track 2), and ranked 3rd in 2009. In the past 4 years, he has published more than 40 papers in the area of social network analysis and mining in several top journals and conferences such as TKDE, SNAM, KDD, SIGIR, WWW, ACL, and MM.

Mi-Yen Yeh received her B.S. and Ph.D. degrees from Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, in 2002 and 2009, respectively. She is now Assistant Research Fellow at Institute of Information Science (and Research Center for IT Innovation under joint appointment), Academia Sinica, Taipei, Taiwan. Her research interests include databases and data mining. She received Exploration Research Award of Pan Wen Yuan Foundation in 2011. She is a member of IEEE and ACM.

Cheng-Te Li is now a Ph.D. candidate in the Graduate Institute of Networking and Multimedia at National Taiwan University, Taipei, Taiwan. His research interests include social network mining and social media analytics. His international recognition includes Facebook Fellowship Finalist award 2012, ACM KDD Cup 2012 champion team, ASONAM 2011 Best Paper Award, and Microsoft Research Asia Fellowship 2010.

References

Y. Tian, R. A. Hankins and J. M. Patel.
Efficient Aggregation for Graph Summarization.
ACM SIGMOD International Conference on Management of Data (SIGMOD’08), 2008.

S. Navlakha, R. Rastogi, N. Shrivastava.
Graph summarization with bounded error.
ACM SIGMOD International Conference on Management of Data (SIGMOD’08), 2008.

Z. Shen, K. L. Ma and T. Eliassi-Rad.
Visual Analysis of Large Heterogeneous Social Networks by Semantic and Structural Abstraction.
IEEE Transactions on Visualization and Computer Graphics, 12(6), 1427–1439, 2006.

C.-T. Li and S.-D. Lin.
Egocentric Information Abstraction for Heterogeneous Social Networks.
International Conference on Advances in Social Network Analysis and Mining (ASONAM’09), 2009.

J.-Y. Li and M-Y. Yeh.
On Sampling Type Distribution from Heterogeneous Social Networks.
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’11). 2011.

J. Leskovec and C. Faloutsos.
Sampling from large graphs.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), 2006.

A. S. Maiya and T. Y. Berger-Wolf.
Benefits of bias: towards better characterization of network sampling.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), 2011.

A. S. Maiya and T. Y. Berger-Wolf.
Sampling community structure.
ACM International World Wide Web Conference (WWW’10), 2010.

C. Hubler, H.-P. Kriegel, K. M. Borgwardt, and Z. Ghahramani.
Metropolis Algorithms for Representative Subgraph Sampling.
IEEE International Conference on Data Mining (ICDM’08), 2008.

C. L. Yang and Shou-De Lin.
Sampling Heterogeneous Social Network.
Technical Report (Master Thesis), 2012.

C. Po and Shou-De Lin.
Sampling Relational Profiles on Heterogeneous Social Network.
Technical Report (Master Thesis), 2011.

N. Zhang, Y. Tian, and J. M. Patel.
Discovery-driven graph summarization.
IEEE International Conference on Data Engineering (ICDE’10), 2010.

M. Mathioudakis, F. Bonchi, C.Castillo, A. Gionis, A. Ukkonen.
Sparsification of Influence Networks.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), 2011.

V. Satuluri, S. Parthasarathy, and Y. Ruan.
Local Graph Sparsification for Scalable Clustering.
ACM SIGMOD International Conference on Management of Data (SIGMOD’11), 2011.

A. Vattani, D. Chakrabarti, and M. Gurevich.
Preserving Personalized Pagerank in Subgraphs.
International Conference on Machine Learning (ICML’11), 2011.

K. LeFevre and E. Terzi.
GraSS: Graph Structure Summarization.
SIAM International Conference on Data Mining (SDM’10), 2010.

F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan.
On Compressing Social Networks,
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09), 2009.

F. Zhou, S. Malher, and H. Toivonen.
Network simplification with minimal loss of connectivity.
IEEE International Conference on Data Mining (ICDM’10), 2010.

H. Toivonen, F. Zhou, A. Hartikainen, and A. Hinkka.
Compression of Weighted Graphs.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), 2011.

U. Kang, H. Tong, J. Sun, C. Y. Lin, and C. Faloutsos.
GBASE: A Scalable and General Graph Management System.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), 2011.