首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Along with the proliferation of big data technology, organizations are involved in an overwhelming data ocean, the huge volume of data makes them at a loss in the face of frequent data breaches due to their failure of efficient data security management. Data classification has become a hot topic as a cornerstone of data protection especially in China in recent years, by categorizing information types and distinguishing protective measures at different classification levels. Both the text and tables of the promulgated data classification-related regulations (for simplicity, laws, regulations, policies, and standards are collectively referred to as “regulations”) contain a wealth of valuable information which can guide the work of data classification. To best assist data practitioners, in this paper, we automatically “grasp” expert experience on how to classify data from the analysis of such regulations. We design a framework, GENONTO, that automatically extracts data classification practices (DCPs), such as information types and their corresponding sensitive levels to construct an information type lexicon as well as to encode a generic ontology on top of 38 real-world regulations promulgated in China. GENONTO employs machine learning techniques and natural language processing (NLP) to parse unstructured text and tables. To our knowledge, GENONTO is the first work that explores critical information like the category and the sensitivity of information types from regulations, and organizes them in a structured form of ontology, characterizing the subsumptive relations between different information types. Our research helps provide a well-defined integrated view across regulations and bridges the gap between what experts say and how data practitioners do.  相似文献   

2.
Due to the harmful impact of fabricated information on social media, many rumor verification techniques have been introduced in recent years. Advanced techniques like multi-task learning (MTL), shared-private models suffer from many strategic limitations that restrict their capability of veracity identification on social media. These models are often reliant on multiple tasks for the primary targeted objective. Even the most recent deep neural network (DNN) models like VRoC, Hierarchical-PSV, StA-HiTPLAN etc. based on VAE, GCN, Transformer respectively with improved modification are able to perform good on veracity identification task but with the help of additional auxiliary information, mostly. However, their rise is still not substantial with respect to the proposed model even though the proposed model is not using any additional information. To come up with an improved DNN model architecture, we introduce globally Discrete Attention Representations from Transformers (gDART). Discrete-Attention mechanism in gDART is capable of capturing multifarious correlations veiled among the sequence of words which existing DNN models including Transformer often overlook. Our proposed framework uses a Branch-CoRR Attention Network to extract highly informative features in branches, and employs Feature Fusion Network Component to identify deep embedded features and use them to make enhanced identification of veracity of an unverified claim. Moreover, to achieve its goal, gDART is not dependent on any costly auxiliary resource but on an unsupervised learning process. Extensive experiments reveal that gDART marks a considerable performance gain in veracity identification task over state-of-the-art models on two real world rumor datasets. gDART reports a gain of 36.76%, 40.85% on standard benchmark metrics.  相似文献   

3.
Retrieving historical fine particulate matter (PM2.5) data is key for evaluating the long-term impacts of PM2.5 on the environment, human health and climate change. Satellite-based aerosol optical depth has been used to estimate PM2.5, but estimations have largely been undermined by massive missing values, low sampling frequency and weak predictive capability. Here, using a novel feature engineering approach to incorporate spatial effects from meteorological data, we developed a robust LightGBM model that predicts PM2.5 at an unprecedented predictive capacity on hourly (R= 0.75), daily (R= 0.84), monthly (R= 0.88) and annual (R= 0.87) timescales. By taking advantage of spatial features, our model can also construct hourly gridded networks of PM2.5. This capability would be further enhanced if meteorological observations from regional stations were incorporated. Our results show that this model has great potential in reconstructing historical PM2.5 datasets and real-time gridded networks at high spatial-temporal resolutions. The resulting datasets can be assimilated into models to produce long-term re-analysis that incorporates interactions between aerosols and physical processes.  相似文献   

4.
With the rapid development of social media and big data technology, user’s sequence behavior information can be well recorded and preserved on different media platforms. It is crucial to model the user preference through mining their sequential behaviors. The goal of sequential recommendation is to predict what a user may interact with in the next moment based on the user’s historical record of interactive sequence. However, existing sequential recommendation methods generally adopt a negative sampling mechanism (e.g. random and uniform sampling) for the pairwise learning, which brings the defect of insufficient training to the model, and decrease the evaluation performance of the entire model. Therefore, we propose a Non-sampling Self-attentive Sequential Recommendation (NSSR) model that combines non-sampling mechanism and self-attention mechanism. Under the premise of ensuring the efficient training of the model, NSSR model takes all pairs in the training set as training samples, so as to achieve the goal of fully training the model. Specifically, we take the interactive sequence as the current user representation, and propose a new loss function to implement the non-sampling training mechanism. Finally, the state-of-the-art result is achieved on three public datasets, Movielens-1M, Amazon Beauty and Foursquare_TKY, and the recommendation performance increase by about 29.3%, 25.7% and 42.1% respectively.  相似文献   

5.
Sequential recommendation models a user’s historical sequence to predict future items. Existing studies utilize deep learning methods and contrastive learning for data augmentation to alleviate data sparsity. However, these existing methods cannot learn accurate high-quality item representations while augmenting data. In addition, they usually ignore data noise and user cold-start issues. To solve the above issues, we investigate the possibility of Generative Adversarial Network (GAN) with contrastive learning for sequential recommendation to balance data sparsity and noise. Specifically, we propose a new framework, Enhanced Contrastive Learning with Generative Adversarial Network for Sequential Recommendation (ECGAN-Rec), which models the training process as a GAN and recommendation task as the main task of the discriminator. We design a sequence augmentation module and a contrastive GAN module to implement both data-level and model-level augmentations. In addition, the contrastive GAN learns more accurate high-quality item representations to alleviate data noise after data augmentation. Furthermore, we propose an enhanced Transformer recommender based on GAN to optimize the performance of the model. Experimental results on three open datasets validate the efficiency and effectiveness of the proposed model and the ability of the model to balance data noise and data sparsity. Specifically, the improvement of ECGAN-Rec in two evaluation metrics (HR@N and NDCG@N) compared to the state-of-the-art model performance on the Beauty, Sports and Yelp datasets are 34.95%, 36.68%, and 13.66%, respectively. Our implemented model is available via https://github.com/nishawn/ECGANRec-master.  相似文献   

6.
Document-level relation extraction (RE) aims to extract the relation of entities that may be across sentences. Existing methods mainly rely on two types of techniques: Pre-trained language models (PLMs) and reasoning skills. Although various reasoning methods have been proposed, how to elicit learnt factual knowledge from PLMs for better reasoning ability has not yet been explored. In this paper, we propose a novel Collective Prompt Tuning with Relation Inference (CPT-RI) for Document-level RE, that improves upon existing models from two aspects. First, considering the long input and various templates, we adopt a collective prompt tuning method, which is an update-and-reuse strategy. A generic prompt is first encoded and then updated with exact entity pairs for relation-specific prompts. Second, we introduce a relation inference module to conduct global reasoning overall relation prompts via constrained semantic segmentation. Extensive experiments on two publicly available benchmark datasets demonstrate the effectiveness of our proposed CPT-RI as compared to the baseline model (ATLOP (Zhou et al., 2021)), which improve the 0.57% on the DocRED dataset, 2.20% on the CDR dataset, and 2.30 on the GDA dataset in the F1 score. In addition, further ablation studies also verify the effects of the collective prompt tuning and relation inference.  相似文献   

7.
Increasing numbers of devices that output large amounts of geographically referenced data are being deployed as the Internet of Things (IoT) continues to expand. Partly as a result of the IoT's dynamic, decentralized, and heterogeneous architecture. These are all examples of the Internet of items (IoT), despite the fact that we might be thinking that one of these items is different from the others. The physical and digital worlds are connected by the Internet of Things (IoT). Nowadays, one of the key goals of the Internet is its own development. This paper provides an in-depth analysis of IoT-based data quality and data preparation strategies developed with multinational corporations in mind. The goal is to make IoT data more trustworthy and practical so that MNCs may use it to their advantage in making educated business decisions. The proposed structure consists of three distinct actions: gathering data, evaluating data quality, and cleaning up raw data. Data preprocessing research is essential since it decides and significantly affects the accuracy of predictions made in later stages. Thus, the recommendation for a special and useful combination in the framework of different data preprocessing task types, which includes the following four technical elements and is briefly justified, is made. The Internet of Things (IoT) is a design pattern in which commonplace items can be equipped with classification, sensing, networking, and processing capabilities that will enable them to communicate with one another over the Internet to fulfill a specific function. The Internet of Things will eventually change physical objects into virtual objects with intelligence. In addition to a detailed analysis of the IoT layer, this article gives an overview of the existing Internet of Things (IoT), technical specifics, and applications in this recently growing field. However, this publication will provide future scholars who desire to conduct study in this area of Internet of Things with a better knowledge.  相似文献   

8.
Named entity recognition (NER) is mostly formalized as a sequence labeling problem in which segments of named entities are represented by label sequences. Although a considerable effort has been made to investigate sophisticated features that encode textual characteristics of named entities (e.g. PEOPLE, LOCATION, etc.), little attention has been paid to segment representations (SRs) for multi-token named entities (e.g. the IOB2 notation). In this paper, we investigate the effects of different SRs on NER tasks, and propose a feature generation method using multiple SRs. The proposed method allows a model to exploit not only highly discriminative features of complex SRs but also robust features of simple SRs against the data sparseness problem. Since it incorporates different SRs as feature functions of Conditional Random Fields (CRFs), we can use the well-established procedure for training. In addition, the tagging speed of a model integrating multiple SRs can be accelerated equivalent to that of a model using only the most complex SR of the integrated model. Experimental results demonstrate that incorporating multiple SRs into a single model improves the performance and the stability of NER. We also provide the detailed analysis of the results.  相似文献   

9.
This paper focuses on how to efficiently find the global Approximate Closed Frequent Itemsets (ACFIs) over streams. To achieve this purpose over a multiple, continuous, rapid and time-varying data stream, a fast, incremental, real-time and little-memory-cost algorithm should be regarded. Based on the max-frequency window model, a Max-Frequency Pattern Tree (MFP-Tree) structure is established to maintain summary information over the global stream. Subsequently, a novel algorithm Generating Global Approximate Closed Frequent Itemsets on Max-Frequency Window model (GGACFI-MFW) is proposed to update the MFP-Tree with high efficiency. The case studies show the efficiency and effectiveness of the proposed approach.  相似文献   

10.
The era of big data has promoted the vigorous development of many industries, boosting the full potential of holistic data-driven analysis, yet it has also been accompanied by uninterrupted data breaches. In recent years, especially in China, data security laws and regulations have been promulgated continuously, and many of them have made clear requirements for data classification. As the support of data security initiatives, data classification has received the bulk of attention and has been hailed by all walks of life. There is a lot of valuable information contained in the issued regulations, which has already been well exploited in the research of privacy policy compliance verification, whereas few scholars have drawn on such information to guide data classification for security and compliance. As a step towards this direction, in this paper, we define two information types: one is “regulated data” mentioned in external laws and regulations, another is “non-regulated data”, indicating internal business data produced in a certain organization, and develop a novel generalization-enhanced decision tree classification algorithm called Gen-DT to classify data. In this way, data covered by the relevant data security regulatory mandates can be quickly identified and handled in full compliance as well. Furthermore, we evaluate the proposed compliance-driven data classification scheme using datasets collected from two famous universities in China and validate that our approach can achieve better performance than existing popular machine learning techniques.  相似文献   

11.
Text-enhanced and implicit reasoning methods are proposed for answering questions over incomplete knowledge graph (KG), whereas prior studies either rely on external resources or lack necessary interpretability. This article desires to extend the line of reinforcement learning (RL) methods for better interpretability and dynamically augment original KG action space with additional actions. To this end, we propose a RL framework along with a dynamic completion mechanism, namely Dynamic Completion Reasoning Network (DCRN). DCRN consists of an action space completion module and a policy network. The action space completion module exploits three sub-modules (relation selector, relation pruner and tail entity predictor) to enrich options for decision making. The policy network calculates probability distribution over joint action space and selects promising next-step actions. Simultaneously, we employ the beam search-based action selection strategy to alleviate delayed and sparse rewards. Extensive experiments conducted on WebQSP, CWQ and MetaQA demonstrate the effectiveness of DCRN. Specifically, under 50% KG setting, the Hits@1 performance improvements of DCRN on MetaQA-1H and MetaQA-3H are 2.94% and 1.18% respectively. Moreover, under 30% and 10% KG settings, DCRN prevails over all baselines by 0.9% and 1.5% on WebQSP, indicating the robustness to sparse KGs.  相似文献   

12.
DNA digital storage provides an alternative for information storage with high density and long-term stability. Here, we report the de novo design and synthesis of an artificial chromosome that encodes two pictures and a video clip. The encoding paradigm utilizing the superposition of sparsified error correction codewords and pseudo-random sequences tolerates base insertions/deletions and is well suited to error-prone nanopore sequencing for data retrieval. The entire 254 kb sequence was 95.27% occupied by encoded data. The Transformation-Associated Recombination method was used in the construction of this chromosome from DNA fragments and necessary autonomous replication sequences. The stability was demonstrated by transmitting the data-carrying chromosome to the 100th generation. This study demonstrates a data storage method using encoded artificial chromosomes via in vivo assembly for write-once and stable replication for multiple retrievals, similar to a compact disc, with potential in economically massive data distribution.  相似文献   

13.
Abstract

This article reviews sources of uncertainty in broadband provision data from Federal Communications Commission’s Form 477 database, which is the largest publicly available broadband database for the United States. This uncertainty analysis reveals that reporting thresholds result in understating of broadband in rural areas serviced by smaller providers in the 1999–2004 ZIP code area dataset. In this same time series, the routine used to aggregate data to larger spatial units (i.e., counties) produces variations in the amount of autocorrelation detected by diagnostic spatial statistics. The amount of autocorrelation in the data also varies with the strategy implemented for interpolating suppressed data. This investigation also highlights the value of a spatial approach to visualizing and analyzing the impact of uncertainty on broadband availability.  相似文献   

14.
【目的/意义】社会感知是借助海量时空数据研究人类时空间行为特征,进而揭示社会经济现象的时空分 布、联系及过程的理论与方法。用户画像旨在通过挖掘用户属性特征和行为模式,以揭示群体、领域乃至社会现象 的内在规律。用户画像是实现社会感知的重要手段。【方法/过程】紧密围绕社会感知数据涉及的情感认知、行为习 惯和社交网络关系三个维度梳理与之映射的用户画像内容维度,更深层次地,对其三个维度中与用户画像息息相 关的应用情境进行分类描述,以期总结其应用价值。【结果/结论】研究发现:当前用户画像的数据源拓展到多源异 构时空数据;研究内容集中在情感及场所语义、空间交互以及社交网络挖掘等方面;时空语义推理等研究是该领域 的拓展;应用情境可以在跨地域空间信息服务、城市人地交互关系挖掘、政府高效治理、领域事件时空关联趋势预 测等主题领域加以展开。【创新/局限】社会感知数据的利用可以在多任务场景下实现态势感知和知识推理。通过 融合多源异构时空数据和集成多种用户画像维度,能够实现多场景的精准知识服务。  相似文献   

15.
Detecting events in real-time from the Twitter data stream has gained substantial attention in recent years from researchers around the world. Different event detection approaches have been proposed as a result of these research efforts. One of the major challenges faced in this context is the high computational cost associated with event detection in real-time. We propose, TwitterNews+, an event detection system that incorporates specialized inverted indices and an incremental clustering approach to provide a low computational cost solution to detect both major and minor newsworthy events in real-time from the Twitter data stream. In addition, we conduct an extensive parameter sensitivity analysis to fine-tune the parameters used in TwitterNews+ to achieve the best performance. Finally, we evaluate the effectiveness of our system using a publicly available corpus as a benchmark dataset. The results of the evaluation show a significant improvement in terms of recall and precision over five state-of-the-art baselines we have used.  相似文献   

16.
In this paper we introduce HEMOS (Humor-EMOji-Slang-based) system for fine-grained sentiment classification for the Chinese language using deep learning approach. We investigate the importance of recognizing the influence of humor, pictograms and slang on the task of affective processing of the social media. In the first step, we collected 576 frequent Internet slang expressions as a slang lexicon; then, we converted 109 Weibo emojis into textual features creating a Chinese emoji lexicon. In the next step, by performing two polarity annotations with new “optimistic humorous type” and “pessimistic humorous type” added to standard “positive” and “negative” sentiment categories, we applied both lexicons to attention-based bi-directional long short-term memory recurrent neural network (AttBiLSTM) and tested its performance on undersized labeled data. Our experimental results show that the proposed method can significantly improve the state-of-the-art methods in predicting sentiment polarity on Weibo, the largest Chinese social network.  相似文献   

17.
We present a microfluidic technique that generates asymmetric giant unilamellar vesicles (GUVs) in the size range of 2–14 μm. In our method, we (i) create water-in-oil emulsions as the precursors to build synthetic vesicles, (ii) deflect the emulsions across two oil streams containing different phospholipids at high throughput to establish an asymmetric architecture in the lipid bilayer membranes, and (iii) direct the water-in-oil emulsions across the oil–water interface of an oscillating oil jet in a co-flowing confined geometry to encapsulate the inner aqueous phase inside a lipid bilayer and complete the fabrication of GUVs. In the first step, we utilize a flow-focusing geometry with precisely controlled pneumatic pressures to form monodisperse water-in-oil emulsions. We observed different regimes in forming water-in-oil multiphase flows by changing the applied pressures and discovered a hysteretic behavior in jet breakup and droplet generation. In the second step of GUV fabrication, an oil stream containing phospholipids carries the emulsions into a separation region where we steer the emulsions across two parallel oil streams using active dielectrophoretic and pinched-flow fractionation separations. We explore the effect of applied DC voltage magnitude and carrier oil stream flow rate on the separation efficiency. We develop an image processing code that measures the degree of mixing between the two oil streams as the water-in-oil emulsions travel across them under dielectrophoretic steering to find the ideal operational conditions. Finally, we utilize an oscillating co-flowing jet to complete the formation of asymmetric giant unilamellar vesicles and transfer them to an aqueous phase. We investigate the effect of flow rates on properties of the co-flowing jet oscillating in the whipping mode (i.e., wavelength and amplitude) and define the phase diagram for the oil-in-water jet. Assays used to probe the lipid bilayer membrane of fabricated GUVs showed that membranes were unilamellar, minimal residual oil remained trapped between the two lipid leaflets, and 83% asymmetry was achieved across the lipid bilayers of GUVs.  相似文献   

18.
针对数据流高速、无限连续和动态不确定性等特点,从提高不确定数据流数据管理能力的角度来解决不确定数据流中异常数据识别问题。首先采用小波分析,将连续数据流流量数据的高频与低频分量分离;其次,结合不确定数据流聚类方法找出数据中的异常点。仿真实验证明,该检测方法能够良好地适应数据流的不确定性,在一定条件下可获得相当好的检测效果。  相似文献   

19.
We present a 91 MHz surface acoustic wave resonator with integrated microfluidics that includes a flow focus, an expansion region, and a binning region in order to manipulate particle trajectories. We demonstrate the ability to change the position of the acoustic nodes by varying the electronic phase of one of the transducers relative to the other in a pseudo-static manner. The measurements were performed at room temperature with 3 μm diameter latex beads dispersed in a water-based solution. We demonstrate the dependence of nodal position on pseudo-static phase and show simultaneous control of 9 bead streams with spatial control of −0.058 μm/deg ± 0.001 μm/deg. As a consequence of changing the position of bead streams perpendicular to their flow direction, we also show that the integrated acoustic-microfluidic device can be used to change the trajectory of a bead stream towards a selected bin with an angular control of 0.008 deg/deg ± 0.000(2) deg/deg.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号