
Research on Text Time Window Partition Based on LDA Model
Long Yixuan, Wang Xiaogang, Zhou Ziwei, Wang Rongsheng, Yi Huifang
Research on Text Time Window Partition Based on LDA Model
[Objective/Significance] Considering that static topic models are difficult to meet users' dynamic analysis needs, in order to solve the problems of high computational costs or deep influence from subjective factors in existing dynamic topic models, this study proposes a text time window partitioning algorithm based on the LDA model, starting from time window similarity. [Method/Process] This study constructs a time window similarity index that integrates differences between time windows and consistency within time windows. This study constructs a time window partitioning algorithm based on this indicator and conducts empirical research using the innovation research field as an example. [Results/Conclusions] By analyzing the average JS divergence between topics under the optimal number of topics within each time window, as well as the average JS divergence between different topics between adjacent time windows, the partitioning results obtained by the algorithm proposed in this study are significantly better than those obtained by multiple fixed time window length partitioning methods, verifying the effectiveness of using the improved LDA model proposed in this study for text time window partitioning. The algorithm proposed in this study to some extent solves the shortcomings of existing dynamic topic models such as high computational costs and strong subjectivity, increases the objectivity and accuracy of text time window partitioning results, and can provide technical support for related research such as theme evolution.
LDA model / time window / dynamic topic model / text similarity / innovation research {{custom_keyword}} /
表1 创新研究在Web of Science数据库中的检索策略 |
数据来源 | 检索时间 限定 | 语种 | 文献 类型 | 检索式 |
---|---|---|---|---|
WoS 核心合集 —SSCI | 所有年份(1900-2022) | 英语 | Article Review Note Letter | TS=("innovat*"))AND(WC=(OPERATIONS RESEARCH MANAGEMENT SCIENCE OR AREA STUDIES OR POLITICAL SCIENCE OR BEHAVIORAL SCIENCES OR BUSINESS OR GEOGRAPHY OR BUSINESS FINANCE OR COMPUTER SCIENCE ARTIFICIAL INTELLIGENCE OR COMPUTER SCIENCE CYBERNETICS OR GREEN SUSTAINABLE SCIENCE TECHNOLOGY OR COMPUTER SCIENCE INFORMATION SYSTEMS OR COMPUTER SCIENCE INTERDISCIPLINARY APPLICATIONS OR COMPUTER SCIENCE SOFTWARE ENGINEERING OR PUBLIC ADMINISTRATION OR COMPUTER SCIENCE THEORY METHODS OR REGIONAL URBAN PLANNING OR SOCIAL ISSUES OR LAW OR SOCIAL SCIENCES BIOMEDICAL OR SOCIAL SCIENCES INTERDISCIPLINARY OR MANAGEMENT OR SOCIAL SCIENCES MATHEMATICAL METHODS OR ECONOMICS OR SOCIAL WORK OR SOCIOLOGY OR URBAN STUDIES)) NOT (WC=(HISTORY OR HISTORY OF SOCIAL SCIENCES OR HISTORY PHILOSOPHY OF SCIENCE OR INFORMATION SCIENCE LIBRARY SCIENCE))) OR (((TS=(economic* SAME science) OR TS=(economic* SAME research) OR TS=(economic* SAME technology) OR TS=(economic* SAME R&D)) AND (WC=ECONOMICS)) OR ((TS=(Technology SAME history) OR TS=(Innovation SAME history)) AND (WC=History)) OR ((TS=(science SAME policy) OR TS=(research SAME policy) OR TS=(technology SAME policy) OR TS=(innovation SAME policy))AND (WC=POLITICAL SCIENCE)) OR ((TS=(management SAME R&D) OR TS=(management SAME new product development) OR TS=(management SAME technology) OR TS=(management SAME knowledge) OR TS=("organization* innovation") OR TS=("organization* learning"))AND (WC=MANAGEMENT)) OR ((TS=(fusion SAME innovation)) AND (WC=SOCIOLOGY) |
表2 创新研究数据动态时间窗口划分结果 |
时间窗 | 起点 | 终点 | 时间窗口宽度 | 数据量 |
---|---|---|---|---|
1 | 2022年 | 2019年 | 4年 | 43606 |
2 | 2018年 | 2014年 | 5年 | 28226 |
3 | 2013年 | 2009年 | 5年 | 20019 |
4 | 2008年 | 2002年 | 7年 | 19676 |
5 | 2001年 | 1986年 | 16年 | 16308 |
6 | 1985年 | 1970年 | 16年 | 1828 |
7 | 1969年 | 1950年 | 20年 | 467 |
表3 创新研究数据2019-2022年时间窗内主题识别结果(部分) |
主题序号 | 主题词 |
---|---|
Topic 0 | innovation | patent | knowledge sharing | SME | value co-creation | absorptive capacity | new product development | dynamic capability | intellectual property | corporate entrepreneurship |
Topic 1 | China | innovation | productivity | Japan | deliberative democracy | developing country | regional development | Indonesia | environment | Vietnam | foreign direct investment |
Topic 2 | environmental regulation | decision making | innovation diffusion | local government | healthcare | medical device | performance measurement | strategy | health technology assessment | R&D intensity |
Topic 3 | Brazil | new product development | South Korea | UK | economic growth | sharing economy | granger causality | university | service innovation | electric vehicle |
Topic 4 | business model innovation | firm performance | social entrepreneurship | corporate sustainability | Germany | absorptive capacity | machine learning | performance | entrepreneurial orientation | higher education |
Topic 5 | climate change | renewable energy | developing country | energy transition | governance | business model innovation | environmental performance | climate change | review | smart city |
…… | …… |
表4 创新研究数据不同时间窗口划分结果评价指标均值 |
评价指标 | 本文提出的 划分算法 | 10年固定 窗口 | 20年固定 窗口 |
---|---|---|---|
F窗口内 一致性 | 0.3689 | 0.4786 | 0.5395 |
F窗口间 差异性 | 1.7923 | 2.4445 | 1.2317 |
[1] |
龙艺璇, 安源, 王东晋, 等. 基于改进LDA模型的铁路领域主题发现研究[J]. 数字图书馆论坛, 2022(2): 26-32.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[2] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[3] |
蒋卓人, 陈燕, 高良才, 等. 一种结合有监督学习的动态主题模型[J]. 北京大学学报(自然科学版), 2015, 51(2): 367-376.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[4] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[5] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[6] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[7] |
桂小庆, 张俊, 张晓民, 等. 时态主题模型方法及应用研究综述[J]. 计算机科学, 2017, 44(2): 46-55.
互联网技术的飞速发展使得数据的规模达到了空前的水平,人们从海量数据中获取有价值的信息变得越来越困难。主题模型是近年来计算机领域出现的一种新的概率模型,在自然语言处理、文本挖掘以及信息检索等领域都有很广泛的应用。基于主题模型的主题追踪技术和时态分析技术可以帮助人们从海量数据中快速找到感兴趣的内容,时态主题模型逐渐成为计算机科学领域的一个研究热点。首先,介绍主题模型以及时态主题模型的基本概念;然后,对各种时态主题模型进行分类,介绍了几种具有代表性的时态主题模型,分析比较了各种主题模型的优缺点;接着,分析了时态主题模型在社交媒体、学术文献和数据社区中的应用;最后,对时态主题模型未来的发展趋势进行了探讨。
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[8] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[9] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[10] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[11] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[12] |
吴菲菲, 张亚茹, 黄鲁成, 等. 基于AToT模型的技术主题多维动态演化分析——以石墨烯技术为例[J]. 图书情报工作, 2017, 61(5): 95-102.
[目的/意义] 基于AToT模型的多维动态演化分析,不仅可以全面地了解技术主题的动态变化,把握不同时期不同企业的技术布局变化,还可以掌握产业链各环节的技术发展状态,为企业创新提供强有力的决策支持。[方法/过程] 首先提取专利文献摘要中的名词或者名词短语,然后利用AToT模型揭示专利文献中隐含的主题演化及专利权人的技术关注点,最后结合产业链信息把握产业各个环节的发展状况。[结果/结论] 实验结果证明,该方法能够高效地分析专利的内容,揭示企业技术主题的动态演化过程。
(
[Purpose/significance] Multi-dimension evolution analysis based on the AToT model could not only provide thorough insights into the evolution process of technology topics, mastering technological composition trends among industries in different periods, but have advantages in analyzing the technological development process in each tache of the industry chain, laying solid foundation for industrial innovation. [Method/process] This paper revealed the latent technology topics and technological attention of patent-owners through keywords and phrases which were extracted from abstracts in each patent document, and showed deep insight into the technological development status with industry chain information. [Result/conclusion] The experiment turns out that the method provided in this paper could not only analyze the content of patents effectively, but reveal the dynamic evolution process of enterprise technological topics. {{custom_citation.content}}
{{custom_citation.annotation}}
|
[13] |
伊惠芳, 吴红, 李昌, 等. 基于主题生命周期与技术熵的石墨烯技术主题演化研究[J]. 情报杂志, 2019, 38(2): 64-70.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[14] |
颜端武, 苏琼, 张馨月. 基于时序主题关联演化的科学领域前沿探测研究[J]. 情报理论与实践, 2019, 42(7): 144-150.
[目的/意义]从时序主题演化的角度,构建探测科学领域研究前沿的途径与方法,为科技创新和科研决策提供有效支撑。[方法/过程]提出基于时序主题关联演化的前沿探测三阶段模型。首先将领域文本集合按照时间窗进行划分,利用LDA主题模型生成各个时间窗的研究主题;再通过相邻时间窗主题之间的相似度建立主题关联,设置主题关联过滤规则并对无效主题关联进行剔除;最后,按主题之间的关联关系构建主题演化路径,根据主题路径变化探测科学领域研究前沿。[结果/结论]以石墨烯领域中文科技文献为研究案例,进行时序主题关联演化分析,探测出石墨烯纳米复合材料及其应用、石墨烯电极材料研究以及石墨烯光电性能和应用三大研究前沿,验证了方法模型的有效性。
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[15] |
张新玲, 陈誉. 国际开放获取研究主题的演化路径分析及启示[J]. 江苏科技信息, 2023, 40(35): 8-14.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[16] |
Discovering the research topics and trends from a large quantity of library electronic references is essential for scientific research. Current research of this kind mainly depends on human justification. The purpose of this paper is to demonstrate how to identify research topics and evolution in trends from library electronic references efficiently and effectively by employing automatic text analysis algorithms.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[17] |
傅柱, 王曰芬, 关鹏. 以分类主题抽取为视角的学科主题挖掘——基于LDA模型的国外知识流研究结构探讨[J]. 情报理论与实践, 2016, 39(8): 96-102.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[18] |
王曰芬, 傅柱, 陈必坤. 基于LDA主题模型的科学文献主题识别:全局和学科两个视角的对比分析[J]. 情报理论与实践, 2016, 39(7): 121-126.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[19] |
关鹏, 王曰芬. 基于LDA主题模型和生命周期理论的科学文献主题挖掘[J]. 情报学报, 2015, 34(3): 286-299.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[20] |
王婷婷, 王宇, 秦琳杰. 基于动态主题模型的时间窗口划分研究[J]. 数据分析与知识发现, 2018, 2(10): 54-64.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[21] |
关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016, 32(9): 42-50.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[22] |
冯之浚, 刘燕华, 方新, 等. 创新是发展的根本动力[J]. 科研管理, 2015, 36(11): 1-10.
(
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[23] |
杨蕙馨, 王军. 让创新驱动发展行稳致远[EB/OL]. [2022-5-1]. http://theory.people.com.cn/n1/2018/0320/c40531-29877266.html.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[24] |
杨维. 如何理解“加快建设创新型国家”[EB/OL]. [2022-6-14]. http://theory.people.com.cn/n1/2017/1213/c40531-29703538.html.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[25] |
王志刚. 加快实现高水平科技自立自强[EB/OL]. 2023-11-01]. http://theory.people.com.cn/n1/2022/1223/c40531-32592268.html.
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[26] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[27] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[28] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
[29] |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
{{custom_ref.label}} |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
/
〈 |
|
〉 |