Mammalian hosts have a close relationship with microorganisms which colonize niches including the urogenital tract, skin, upper and lower respiratory tract, intestine and internal organs . Many important biological interactions and processes arise from a diverse variety of microbes, and therefore human microbiome is emerging as an essential “organ” governing health and disease [2–4]. For example, commensal bacteria from around 500–1000 species inhabiting the skin have been reported to be involved in educating immune system in response to infection and injury, and maintaining homeostatic control of skin inflammation . The presence of nearly 1014 bacterial cells from more than 10,000 microbial species in human internal environment provides diverse gene products which induce different biochemical and metabolic activities [6–8]. Even though the massive contribution of microbes has been revealed, a detailed understanding of mechanisms underlying host–microbe interactions and their impact on different human diseases remains largely elusive .
The composition of endogenous microbial community can undergo constant changes and differ from person to person owing to different environmental variable such as host diet [10, 11], season , smoking , hygiene and use of antibiotics . The deviant compositions of microbial community can lead to varying degrees of damage to the tissues of hosts and further induces diverse diseases . And the abundance distribution of microbes has also been reported to be associated with several human diseases . For example, low microbial diversity can cause obesity and inflammatory bowel disease [17, 18], while high microbial diversity in the vagina is linked to bacterial vaginosis . Pathogenic microbes can endure selective pressures of their environment with different strategies, and this genetically distinct population of microbes is usually regarded as contributor for different diseases such as allergic asthma , colorectal carcinoma , necrotizing enterocolitis [22, 23], atopic dermatitis  and psoriasis . For example, Skov et al. have reported that the toxins from Streptococcus and Staphylococcus aureus can function as superantigens which boost the development of guttate psoriasis by bypassing the normal control of T cell activation . Socransky et al. have observed that subgingival plaque is associated with several major microbial complexes including Fusobacterium, Porphyromonas gingivalis, Prevotella and Treponema . Sze et al. have also identified an increase of the Firmicutes phylum and Burkholderia in patients with very sever chronic obstructive pulmonary disease (COPD) by Pyrotag sequencing .
With the development of experimental tools such as PCR, high-throughput sequencing and MALDI-TOF mass spectrometry (MS) as well as new sampling and culture strategies, much progress has been made towards discovering the mechanisms of microbial pathogenesis and microbe–disease associations [16, 29, 30]. Although an increasing amount has been discovered and recorded about the associations between microbes and diseases, technological hurdles remain to detect microbe–disease associations on a large scale . Rather than a ‘one-bacterium, one-disease’ model, diseases are usually cased and influenced by the dynamic interplay between host and microbe and the complex activity of microbial community. Experiment-based methods for identifying microbe–disease associations usually need a long and densely sampled time series to observe many individuals with different traits because of different host pressures and the dynamic microbial behavior. In addition, the host–microbe interactions involved in different diseases are still hard to be verified as accidental or obligatory based on the transcriptomics .
Even though the regulatory mechanism by way of which microbial participators get involved is still not well known, further ventures into identification of microbe–disease associations would boost diagnostic and therapeutic support for the clinical management of patients. Knowledge about microbe–disease associations can provide valuable insights into understanding complex disease mechanisms. For example, gastric and duodenal ulcers and Whipple’s disease, which were considered as noninfectious in origin, have been reclassified as infectious ones after the identification of associated pathogenic organisms . In addition, knowing the disease-causing microbes can also illuminate newer ways to promote disease diagnosis and therapy. For example, fecal microbiota transplantation has recently proved to be a safe and feasible treatment option for clostridium difficile infection (CDI) , which tries to rebuild healthy microbial community by reintroducing normal flora via donor feces. Detecting novel microbial participators engaging the disease development is clearly important for the application of this treatment. Predicting new microbe–disease associations is expected to select the most potential candidates for validation experiments and therefore to accelerate the researches and reduce cost. However, little effort has been made to develop prediction models for referring novel microbe–disease associations. Recently, the first database storing microbe–disease association data called HMDAD has been built by Ma et al. by manually curating from large-scale pubic literatures and the researchers discovered that the microbe-based disease network has strong overlaps with those disease network constructed based on genes, symptoms, chemical fragments and drugs. Specifically, HMDAD mainly focuses on non-infective diseases which are rarely clinically studied from a microbial perspective.
In this work, we have proposed a neighbor- and graph-based combined recommendation model for human microbe–disease association prediction (NGRHMDA). This model is mainly based on the assumption that functionally similar microbes tend to intertwine in the development of similar disease, similar with the basic hypothesis of recommended systems that users who owns the same/similar likings will like similar kinds of items. NGRHMDA model is combined by two separate recommendation model, one of which is neighbor-based collaborative filtering and the other is based on topological information of known microbe–disease bipartite graph. And this model combines symptom-based similarity and Gaussian kernel-based similarity for measuring disease and microbe similarity. To evaluate the effectiveness of the proposed model, two evaluation frameworks (i.e. lease-one-out and fivefold cross validations) have been implemented on HMDAD database, and the corresponding ROC curves have been computed. As a result, the ensemble model of NGRHMDA yielded an average AUC of 0.9023 ± 0.0031 for fivefold cross validation and AUC of 0.9111 for LOOCV, which increased at least 0.0169 and 0.0130 from the single models. In addition, the stability of the model was showed to be improved by combining. The prediction results showed additional disease similarity, like symptom-based similarity we explored, can improve the prediction performance of NGRHMDA, and fully demonstrated that the proposed model is feasible and effective to predict potential microbe–disease association on a large scale.