Biased samples, non-random distributions, and non-independence
While holding remarkable promise, iEcology is subject to several inherent challenges and gaps that require careful consideration when undertaking such research. Primarily, it is important always to keep in mind that while ever increasing, the digital realm only encompasses a subset of the world – one that is non-random in extent and depth. Indeed, as the data are not generated systematically, there is great variance in content generation among different users, regions, cultures, and time-frames, with inherent risks of biases. Such individual and cultural subjectivity can further complicate data interpretation. Moreover, multiple entries of the same data by single or different users could cause biases related to non-independence. Therefore, underlying data for iEcology research should neither be treated as randomly distributed, nor used in raw form without addressing these issues. Indeed, rather than ignoring such considerations, specific investigations into aspects of the data such as the non-random distribution and the level of non-independence can actually provide further insights into data structure and any discovered patterns.
Cross-source Validation, ground-truthing, and cross reference
Several approaches, many already recognized within other fields of research that rely on online data, can be used to tackle the challenges of biased samples, non-random distributions, and non-independence. Validation with common and reputable sources such as systematic surveys, remote sensing, and citizen science (i.e., ground-truthing) can decrease the level of associated uncertainty and help reinforce confidence in the data and its interpretation. This is particularly important when testing new tools or approaches. The vast majority of iEcology studies have used multiple data sources to validate results, including data from field research, citizen science, online databases, scientific literature, or their combination. In most cases, authors report a satisfying to excellent level of consistency among data sources. When ground-truthing is difficult, as is often the case, other metrics could be developed to assess data robustness. We also strongly advocate cross-referencing results across multiple iEcology data sources, to test consistency of patterns, which – beneficially – are often available within the digital realm. Furthermore, culturomics can provide critical support to understand societal perceptions, interests, and values that affect the process of data generation.
Misidentification, Misclassification, and automated validation
Correct taxonomic identification in iEcology may be a cause for concern when compared to traditional ecological research. This may be true at several levels - from species misidentification by data producers to challenges that experts face when identifying species based on a limited number of images or videos of an individual organism. Furthermore, automated classification of species also generates misidentifications. Such embedded errors could also arise in other types of ecological data, such as life history traits, behavior, and abiotic variables. However, we expect that as iEcology sources increase in size, and methods to validate them improve, so will the ability to identify the extent and type of such problems in the data. Furthermore, we also suggest assigning a ‘validity’ attribute to data which can be non-binary, and dependent on the contributor‘s reputation and the likelihood of an observation - as is currently practiced on some citizen-science platforms.
Collaborations
iEcology research would greatly benefit from collaborative efforts and sharing of data, resources and tools. These could be aided by developing specific metadata standards for sharing such data, which could include API and specific machine-learning algorithms used to extract or manipulate the data. Such developments could draw from similar efforts that are already being carried out by big ecological databases (e.g. GBIF) to develop similar standards, which would make ecological data more interoperable.
iEcology repositories could be either centralized or remain decentralized, with benefits associated with both options. Nevertheless, we advocate that good record keeping and maintaining high metadata standards is of particular importance to iEcology.
iEcology repositories could be either centralized or remain decentralized, with benefits associated with both options. Nevertheless, we advocate that good record keeping and maintaining high metadata standards is of particular importance to iEcology.
Reproducibility, transparency, and open source science
Other considerations of iEcology data sources involve interpretation and reproducibility. Some sources lack transparency in the way the considered data were produced and manipulated (e.g. search engines such as Google). Inability to publish raw data (as per provider guidelines) could also cause issues with scientific journal protocols that require making these available. Furthermore, some sources lack stability in data scope, underlying algorithms, and access options. These are inherent issues with many online sources. To alleviate these concerns, we advocate:
1) good record keeping of protocols for data access, handling, versioning, and analysis.
2) harmonization of methods and standardization of metadata.
3) publishing raw data in freely accessible and stable repositories together with associated scripts.
4) use of open-source data and software.
5) keeping up-to-date with methodologies developed in other relevant fields for assessing and addressing such issues.
1) good record keeping of protocols for data access, handling, versioning, and analysis.
2) harmonization of methods and standardization of metadata.
3) publishing raw data in freely accessible and stable repositories together with associated scripts.
4) use of open-source data and software.
5) keeping up-to-date with methodologies developed in other relevant fields for assessing and addressing such issues.
Ethical issues
iEcology research may give rise to several ethical issues, pertaining to both people and nature. Data shared online, especially on social media platforms, sometimes include explicit personal information, while implicit information could also be used to identify individuals or to extract sensitive information. Therefore, the privacy of individuals and their identifiers should be maintained in both data repositories and iEcology outputs, adhering to the highest ethical standards. Moreover, data sources that include precise information on locations and other key attributes of rare or endangered species could increase their exposure to poachers and collectors. This threat could be alleviated by either restricting access to data on species deemed at risk, or limiting precision of open-access information. In general, servers holding iEcology data should be securely maintained to avoid such abuse.