首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
In the present work we perform compressed pattern matching in binary Huffman encoded texts [Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. of the IRE, 40, 1098–1101]. A modified Knuth–Morris–Pratt algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text. We propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch since the alphabet is binary. To avoid processing any bit of the encoded text more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that we are always able to align the start of the encoded pattern with the start of a codeword in the encoded text. We combine our KMP algorithm with two practical Huffman decoding schemes which handle more than a single bit per machine operation; skeleton trees defined by Klein [Klein, S. T. (2000). Skeleton trees for efficient decoding of huffman encoded texts. Information Retrieval, 3, 7–23], and numerical comparisons between special canonical values and portions of a sliding window presented in Moffat and Turpin [Moffat, A., & Turpin, A. (1997). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 1200–1207]. Experiments show rapid search times of our algorithms compared to the “decompress then search” method, therefore, files can be kept in their compressed form, saving memory space. When compression gain is important, these algorithms are better than cgrep [Ferragina, P., Tommasi, A., & Manzini, G. (2004). C Library to search over compressed texts, http://roquefort.di.unipi.it/~ferrax/CompressedSearch], which is only slightly faster than ours.  相似文献   

2.
DNA digital storage provides an alternative for information storage with high density and long-term stability. Here, we report the de novo design and synthesis of an artificial chromosome that encodes two pictures and a video clip. The encoding paradigm utilizing the superposition of sparsified error correction codewords and pseudo-random sequences tolerates base insertions/deletions and is well suited to error-prone nanopore sequencing for data retrieval. The entire 254 kb sequence was 95.27% occupied by encoded data. The Transformation-Associated Recombination method was used in the construction of this chromosome from DNA fragments and necessary autonomous replication sequences. The stability was demonstrated by transmitting the data-carrying chromosome to the 100th generation. This study demonstrates a data storage method using encoded artificial chromosomes via in vivo assembly for write-once and stable replication for multiple retrievals, similar to a compact disc, with potential in economically massive data distribution.  相似文献   

3.
This paper studies mathematical properties of h-index sequences as developed by Liang [Liang, L. (2006). h-Index sequence and h-index matrix: Constructions and applications. Scientometrics,69(1), 153–159]. For practical reasons, Liming studies such sequences where the time goes backwards while it is more logical to use the time going forward (real career periods). Both type of h-index sequences are studied here and their interrelations are revealed. We show cases where these sequences are convex, linear and concave. We also show that, when one of the sequences is convex then the other one is concave, showing that the reverse-time sequence, in general, cannot be used to derive similar properties of the (difficult to obtain) forward time sequence. We show that both sequences are the same if and only if the author produces the same number of papers per year. If the author produces an increasing number of papers per year, then Liang’s h-sequences are above the “normal” ones. All these results are also valid for g- and R-sequences. The results are confirmed by the h-, g- and R-sequences (forward and reverse time) of the author.  相似文献   

4.
The main aim of this article is to extend the notion of strongly Cesàro summable and strongly lacunary summable real sequences to n-normed linear space valued (n-nls valued) difference sequences. Consequently we introduce the spaces |σ1|(X,?) and Nθ(X,?), respectively, where X is an n-normed space and ? is a difference operator. We investigate these spaces for completeness as well as for the relationship between these spaces.  相似文献   

5.
Sequences of integers are common data types, occurring either as primary data or ancillary structures. The sizes of sequences can be large, making compression an interesting option. Effective compression presupposes variable-length coding, which destroys the regular alignment of values. Yet it would often be desirable to access only a small subset of the entries, either by position (ordinal number) or by content (element value), without having to decode most of the sequence from the start. Here such a random access technique for compressed integers is described, with the special feature that no auxiliary index is needed. The solution applies a method called interpolative coding, which is one of the most efficient non-statistical codes for integers. Indexing is avoided by address calculation guaranteeing sufficient space for codes even in the worst case. The additional redundancy, compared to regular interpolative coding, is only about 1 bit per source integer for uniform distribution. The time complexity of random access is logarithmic with respect to the source size for both position-based and content-based retrieval. According to experiments, random access is faster than full decoding when the number of accessed integers is not more than approximately 0.75 · n/log2n for sequence length n. The tests also confirm that the method is quite competitive with other approaches to random access coding, suggested in the literature.  相似文献   

6.
We propose bidirectional imparting or BiImp, a generalized method for aligning embedding dimensions with concepts during the embedding learning phase. While preserving the semantic structure of the embedding space, BiImp makes dimensions interpretable, which has a critical role in deciphering the black-box behavior of word embeddings. BiImp separately utilizes both directions of a vector space dimension: each direction can be assigned to a different concept. This increases the number of concepts that can be represented in the embedding space. Our experimental results demonstrate the interpretability of BiImp embeddings without making compromises on the semantic task performance. We also use BiImp to reduce gender bias in word embeddings by encoding gender-opposite concepts (e.g., male–female) in a single embedding dimension. These results highlight the potential of BiImp in reducing biases and stereotypes present in word embeddings. Furthermore, task or domain-specific interpretable word embeddings can be obtained by adjusting the corresponding word groups in embedding dimensions according to task or domain. As a result, BiImp offers wide liberty in studying word embeddings without any further effort.  相似文献   

7.
We present a 3-staged method for automated learning of the spatial density function of the mass of all gravitating matter in a real galaxy, for which, data exist on the observable phase space coordinates of a sample of resident galactic particles that trace the galactic gravitational potential. We learn this gravitational mass density function, by embedding it in the domain of the probability density function (pdf) of the phase space vector variable, where we learn this pdfas well, given the data. We generate values of each sought function, at a design value of its input, to learn vectorised versions of each function; this creates the training data, using which we undertake supervised learning of each function, to thereafter undertake predictions and forecasting of the functional value, at test inputs. We assume that the phase space that a kinematic data set is sampled from, is isotropic, and we quantify the relative violation of this assumption, in a given data set. Illustration of the method is made to the real elliptical galaxy NGC4649. The purpose of this learning is to produce a data-driven protocol that allows for computation of dark matter content in any example real galaxy, without relying on system- specific astronomical details, while undertaking objective quantification of support in the data for undertaken model assumptions.  相似文献   

8.
Web users often have a specific goal in mind comprising various stages that are reflected, as executed, by their mouse cursor movements. Therefore, is it possible to detect automatically which parts of those movements bear any intent and discard the parts that have no intent? Can we estimate the intent degree of the non-discarded parts? To achieve this goal, we tap into the Kinematic Theory and its associated Sigma-Lognormal model (ΣΛM). According to this theory, the production of a mouse cursor movement requires beforehand the instantiation of an action plan. The ΣΛM models such an action plan as a sequence of strokes’ velocity profiles, one stroke at a time, providing thus a reconstruction of the original mouse cursor movement. When a user intent is clear, the pointing movement is faster and the cursor movement is reconstructed almost perfectly, while the reverse is observed when the user intent is unclear.We analyzed more than 10,000 browsing sessions comprising about 5 million of data points, and compared different segmentation techniques to detect discrete cursor chunks that were then reconstructed with the ΣΛM. Our main contribution is thus a novel methodology to automatically tell chunks with and without intention apart. We also contribute with kinematic compression, a novel application to compress mouse cursor data while preserving most of the original information. Ultimately, this work enables a deeper understanding of mouse cursor movements production, providing an informed means to gain additional insight about users’ browsing behavior.  相似文献   

9.
A cold cathode discharge tube has an auxiliary tube attached from which cathode-rays are projected against the main cathode. A photo-electric cell, attached to a monochromatic illuminator, is used to measure relative intensity distribution of Hγ and Hδ respectively, from the main cathode through the cathode dark space into the negative glow with and without excitation of the auxiliary tube. When the main cathode is bombarded by the electron stream the intensity of spectral illumination in the negative glow is increased by about 20 per cent. This increase does not result from the mere addition of an illumination, which appears when the auxiliary tube is alone excited, to the illumination of the main discharge, but may be attributed to the production of soft X-rays in the gas which are capable of exciting the gas molecules.R. Seeliger and co-workers have investigated spectrophotometrically the several characteristic sections of a cold cathode discharge tube. Their method consisted of an examination, with a microphotometer, of spectrograms taken at points along the discharge. The intensity distribution of any spectral line was found continuous in passing from one portion of the discharge to another, e.g., from the Faraday dark space into the positive column, and that the maxima of illumination for different lines appeared displaced relative to one another.A. Wehnelt and A. Jachan demonstrated with several experimental arrangements the effect of bombarding the cold cathode of an ordinary discharge tube with a beam of cathode-rays. There resulted an increase in the total intensity of illumination in the tube together with a shrinkage of the cathode dark space.The purpose of the present investigation was to examine the effect such an electronic bombardment would produce upon the spectral intensity distribution near the cathode of a hydrogen discharge in and alone were capable of investigation.  相似文献   

10.
Modeling user profiles is a necessary step for most information filtering systems – such as recommender systems – to provide personalized recommendations. However, most of them work with users or items as vectors, by applying different types of mathematical operations between them and neglecting sequential or content-based information. Hence, in this paper we study how to propose an adaptive mechanism to obtain user sequences using different sources of information, allowing the generation of hybrid recommendations as a seamless, transparent technique from the system viewpoint. As a proof of concept, we develop the Longest Common Subsequence (LCS) algorithm as a similarity metric to compare the user sequences, where, in the process of adapting this algorithm to recommendation, we include different parameters to control the efficiency by reducing the information used in the algorithm (preference filter), to decide when a neighbor is considered useful enough to be included in the process (confidence filter), to identify whether two interactions are equivalent (δ-matching threshold), and to normalize the length of the LCS in a bounded interval (normalization functions). These parameters can be extended to work with any type of sequential algorithm.We evaluate our approach with several state-of-the-art recommendation algorithms using different evaluation metrics measuring the accuracy, diversity, and novelty of the recommendations, and analyze the impact of the proposed parameters. We have found that our approach offers a competitive performance, outperforming content, collaborative, and hybrid baselines, and producing positive results when either content- or rating-based information is exploited.  相似文献   

11.
In this article we introduce the notion of I-acceleration convergence of sequences. We prove the decomposition theorem for I-acceleration convergence of sequences as well as for subsequence transformations. We study different properties of I-acceleration convergence of sequences.  相似文献   

12.
In this paper, we study the Riesz basis property of the generalized eigenfunctions of a one-dimensional hyperbolic system in the energy state space. This characterizes the dynamic behavior of the system, particularly the stability, in terms of its eigenfrequencies. This system is derived from a thermoelastic equation with memory type. The asymptotic expansions for eigenvalues and eigenfunctions are developed. It is shown that there is a sequence of generalized eigenfunctions, which forms a Riesz basis for the Hilbert state space. This deduces the spectrum-determined growth condition for the C0-semigroup associated with the system, and as a consequence, the exponential stability of the system is concluded.  相似文献   

13.
Herein, we describe the development of a novel primer system that allows for the capture of double-stranded polymerase chain reaction (PCR) amplification products onto a microfluidic channel without any preliminary purification stages. We show that specially designed PCR primers consisting of the main primer sequence and an additional “tag sequence” linked through a poly(ethylene glycol) molecule can be used to generate ds-PCR amplification products tailed with ss-oligonucleotides of two forensically relevant genes (amelogenin and human c-fms (macrophage colony-stimulating factor) proto-oncogene for the CSF-1 receptor (CSF1PO). Furthermore, with a view to enriching and eluting the ds-PCR products of amplification on a capillary electrophoretic-based microfluidic device we describe the capture of the target ds-PCR products onto poly(dimethylsiloxane) microchannels modified with ss-oligonucleotide capture probes.  相似文献   

14.
Learning low dimensional dense representations of the vocabularies of a corpus, known as neural embeddings, has gained much attention in the information retrieval community. While there have been several successful attempts at integrating embeddings within the ad hoc document retrieval task, yet, no systematic study has been reported that explores the various aspects of neural embeddings and how they impact retrieval performance. In this paper, we perform a methodical study on how neural embeddings influence the ad hoc document retrieval task. More specifically, we systematically explore the following research questions: (i) do methods solely based on neural embeddings perform competitively with state of the art retrieval methods with and without interpolation? (ii) are there any statistically significant difference between the performance of retrieval models when based on word embeddings compared to when knowledge graph entity embeddings are used? and (iii) is there significant difference between using locally trained neural embeddings compared to when globally trained neural embeddings are used? We examine these three research questions across both hard and all queries. Our study finds that word embeddings do not show competitive performance to any of the baselines. In contrast, entity embeddings show competitive performance to the baselines and when interpolated, outperform the best baselines for both hard and soft queries.  相似文献   

15.
Backgroundβ-Galactosidases catalyze both hydrolytic and transgalactosylation reactions and therefore have many applications in food, medical, and biotechnological fields. Aspergillus niger has been a main source of β-galactosidase, but the properties of this enzyme are incompletely studied.ResultsThree new β-galactosidases belonging to glycosyl hydrolase family 35 from A. niger F0215 were cloned, expressed, and biochemically characterized. In addition to the known activity of LacA encoded by lacA, three putative β-galactosidases, designated as LacB, LacC, and LacE encoded by the genes lacB, lacC, and lacE, respectively, were successfully cloned, sequenced, and expressed and secreted by Pichia pastoris. These three proteins and LacA have N-terminal signal sequences and are therefore predicted to be extracellular enzymes. They have the typical structure of fungal β-galactosidases with defined hydrolytic and transgalactosylation activities on lactose. However, their activity properties differed. In particular, LacB and lacE displayed maximum hydrolytic activity at pH 4–5 and 50°C, while LacC exhibited maximum activity at pH 3.5 and 60°C. All β-galactosidases performed transgalactosylation activity optimally in an acidic environment.ConclusionsThree new β-galactosidases belonging to glycosyl hydrolase family 35 from A. niger F0215 were cloned and biochemically characterized. In addition to the known LacA, A. niger has at least three β-galactosidase family members with remarkably different biochemical properties.  相似文献   

16.
It is shown that cartesian product and pointwise-sum with a fixed compact set preserve various approximation-theoretic properties. Results for pointwise-sum are proved for F-spaces and so hold for any normed linear space, while the other results hold in general metric spaces. Applications are given to approximation of Lp-functions on the d-dimensional cube, 1?p<∞, by linear combinations of half-space characteristic functions; i.e., by Heaviside perceptron networks.  相似文献   

17.
The discovery of ProtoRAG in amphioxus indicated that vertebrate RAG recombinases originated from an ancient transposon. However, the sequences of ProtoRAG terminal inverted repeats (TIRs) were obviously dissimilar to the consensus sequence of mouse 12/23RSS and recombination mediated by ProtoRAG or RAG made them incompatible with each other. Thus, it is difficult to determine whether or how 12/23RSS persisted in the vertebrate RAG system that evolved from the TIRs of ancient RAG transposons. Here, we found that the activity of ProtoRAG is highly dependent on its asymmetric 5′TIR and 3′TIR, which are composed of conserved TR1 and TR5 elements and a partially conserved TRsp element of 27/31 bp to separate them. Similar to the requirements for the recombination signal sequences (RSSs) of RAG recombinase, the first CAC in TR1, the three dinucleotides in TR5 and the specific length of the partially conserved TRsp are important for the efficient recombination activity of ProtoRAG. In addition, the homologous sequences flanking the signal sequences facilitate ProtoRAG- but not RAG-mediated recombination. In addition to the diverged TIRs, two differentiated functional domains in BbRAG1L were defined to coordinate with the divergence between TIRs and RSSs. One of these is the CTT* domain, which facilitates the specific TIR recognition of the BbRAGL complex, and the other is NBD*, which is responsible for DNA binding and the protein stabilization of the BbRAGL complex. Thus, our findings reveal that the functional requirement for ProtoRAG TIRs is similar to that for RSS in RAG-mediated recombination, which not only supports the common origin of ProtoRAG TIRs and RSSs from the asymmetric TIRs of ancient RAG transposons, but also reveals the development of RAG and RAG-like machineries during chordate evolution.  相似文献   

18.
Let X=x1,x2,…,xnX=x1,x2,,xn be a sequence of non-decreasing integer values. Storing a compressed representation of X that supports access and search is a problem that occurs in many domains. The most common solution to this problem uses a linear list and encodes the differences between consecutive values with encodings that favor small numbers. This solution includes additional information (i.e. samples) to support efficient searching on the encoded values. We introduce a completely different alternative that achieves compression by encoding the differences in a search tree. Our proposal has many applications, such as the representation of posting lists, geographic data, sparse bitmaps, and compressed suffix arrays, to name just a few. The structure is practical and we provide an experimental evaluation to show that it is competitive with the existing techniques.  相似文献   

19.
Researchers have been aware that emotion is not one-hot encoded in emotion-relevant classification tasks, and multiple emotions can coexist in a given sentence. Recently, several works have focused on leveraging a distribution label or a grayscale label of emotions in the classification model, which can enhance the one-hot label with additional information, such as the intensity of other emotions and the correlation between emotions. Such an approach has been proven effective in alleviating the overfitting problem and improving the model robustness by introducing a distribution learning component in the objective function. However, the effect of distribution learning cannot be fully unfolded as it can reduce the model’s discriminative ability within similar emotion categories. For example, “Sad” and “Fear” are both negative emotions. To address such a problem, we proposed a novel emotion extension scheme in the prior work (Li, Chen, Xie, Li, and Tao, 2021). The prior work incorporated fine-grained emotion concepts to build an extended label space, where a mapping function between coarse-grained emotion categories and fine-grained emotion concepts was identified. For example, sentences labeled “Joy” can convey various emotions such as enjoy, free, and leisure. The model can further benefit from the extended space by extracting dependency within fine-grained emotions when yielding predictions in the original label space. The prior work has shown that it is more apt to apply distribution learning in the extended label space than in the original space. A novel sparse connection method, i.e., Leaky Dropout, is proposed in this paper to refine the dependency-extraction step, which further improves the classification performance. In addition to the multiclass emotion classification task, we extensively experimented on sentiment analysis and multilabel emotion prediction tasks to investigate the effectiveness and generality of the label extension schema.  相似文献   

20.
BackgroundThe salivary glands of Lucilia sericata are the first organs to express specific endopeptidase enzymes. These enzymes play a central role in wound healing, and they have potential to be used therapeutically.MethodsRapid amplification of cDNA ends and rapid amplification of genomic ends were used to identify the coding sequence of MMP-1 from L. sericata. Different segments of MMP1 gene, namely the middle part, 3′ end, and 5′ end, were cloned, sequenced, and analyzed using bioinformatics tools to determine the distinct features of MMP-1 protein.ResultsAssembling the different segments revealed that the complete mRNA sequence of MMP-1 is 1932 bp long. CDS is 1212 bp long and is responsible for the production of MMP-1 of 404 amino acid residues with a predicted molecular weight of 45.1 kDa. The middle part, 3′ end, and 5′ end sequences were 933, 503, and 496 bp. In addition, it was revealed that the MMP-1 genomic sequence includes three exons and two introns. Furthermore, the three-dimensional structure of L. sericata MMP-1 protein was evaluated, and its alignment defined that it has high similarity to chain A of human MMP-2 with 100% confidence, 72% coverage, and 38% identity according to the SWISS-MODEL modeling analysis.ConclusionsMMP-1 of L. sericata has a close relationship with its homologs in invertebrates and other insects. The present study significantly contributes to understanding the function, classification, and evolution of the characterized MMP-1 from L. sericata and provides basic required information for the development of an effective medical bioproduct.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号