全词消歧(All-Words Word Sense Disambiguation)可以看作一个序列标注问题,该文提出了两种基于序列标注的全词消歧方法,它们分别基于隐马尔可夫模型(Hidden Markov Model,HMM)和最大熵马尔可夫模型(Maximum Entropy Markov Model,MEMM)。首先,我们用HMM对全词消歧进行建模。然后,针对HMM只能利用词形观察值的缺点,我们将上述HMM模型推广为MEMM模型,将大量上下文特征集成到模型中。对于全词消歧这类超大状态问题,在HMM和MEMM模型中均存在数据稀疏和时间复杂度过高的问题,我们通过柱状搜索Viterbi算法和平滑策略来解决。最后,我们在Senseval-2和Senseval-3的数据集上进行了评测,该文提出的MEMM方法的F1值为0.654,超过了该评测上所有的基于序列标注的方法。
文中提出了一种可并行分解的层次化动态社区发现算法D-SNCD(Dynamic Social Network CommunityDiscovery).D-SNCD算法充分利用复杂动态社会网络变化的局部性,对算法生成的层次化社区树HOT(Hierar-chical cOmmunity Tree)的分枝进行选择性更新.与传统的对动态社会网络直接采用快照方式进行社区发现相比,D-SNCD算法在效率上取得了明显的提高.由于D-SNCD是对已有的静态社区并行计算方法P-SNCD(ParallelSocial Network Community Discovery)的进一步扩展,因而D-SNCD保持着P-SNCD算法的高扩展性和高分辨率等优点.另外,D-SNCD算法对用户参数输入要求简单.严格的数学证明和充分的实验数据保证了整个算法的正确性和有效性.
Microblogs have become an important platform for people to publish,transform information and acquire knowledge.This paper focuses on the problem of discovering user interest in microblogs.In this paper,we propose a topic mining model based on Latent Dirichlet Allocation(LDA) named user-topic model.For each user,the interests are divided into two parts by different ways to generate the microblogs:original interest and retweet interest.We represent a Gibbs sampling implementation for inference the parameters of our model,and discover not only user's original interest,but also retweet interest.Then we combine original interest and retweet interest to compute interest words for users.Experiments on a dataset of Sina microblogs demonstrate that our model is able to discover user interest effectively and outperforms existing topic models in this task.And we find that original interest and retweet interest are similar and the topics of interest contain user labels.The interest words discovered by our model reflect user labels,but range is much broader.