Mission And Goals
The mission of the Asian American CV19 Archive Project is to document and keep a historical record of bias,
xenophobia, displacement, and injustice, against the Asian American and Asian community during the Covid-19 pandemic, and to share that
data with all communities via open source technologies, where no single entity or individual owns the historical archive, helping to
ensure that it can never be erased.
In order to accomplish this mission the main goals of the archive project are as follows:
- Gather lists of news articles, blog posts, and other electronic spaces specific to its mission
- Archive each piece of media with:
- Url information to the existing source
- Date of publication
- Main image and thumbnail storage
- HTML archive of the original source
- A PDF archive of the original source
- Distribute and share the information using open source technology, so individuals or organizations can create mirrors or "forks" of the main
data to use as they wish, or in collaboration with the Asian American CV19 Archive Project or any other entity or individual. Just as important
as keeping a historical record during this time, it is the belief of the project that the information should not be owned by any one individual
Data Collection (Information And FAQ)
How is data collected?
Data (articles, posts, and associated media) is collected via three main methods:
- Automated News Collector Bot: This is run daily and collects information from various sites and organizations.
- Manual Entry: Data is also collected manually.
- PDF Conversions: While HTML archives are collected right away during the data collection step,
PDFs are created in a separate process.
How is the decision made of what content is entered into the archive?
- After data is collected via the automated news collector bot OR manual entry, it is looked over by an editor before going
into the final archive. The editor looks at the content to ensure that it is relevant to mission of the archive project.
Are there multiple entries for one topic?
- The easiest answer to this question is that one topic can have multiple entries, if that topic is being reported on by a number
of different sources. As an example, a news organization in one state may report on an incident while a personal blogger in that same state
may report on it from their POV. At the same time, the archive does its best to ensure that only one entry exists in cases of syndicated
content (e.g. an AP news article).
Why don't all entries contain an image?
- Either the automated bot was not able to obtain that information via metadata, or the content had no image to get (in the case of
manual entries). In some cases an image
is taken of a portion of the article by an editor to be used for that specific content. The archive, when time allows, is trying to
fill records without images, as it believes images are an important part of the historical document.
Why don't all entries contain an HTML or PDF archive?
Sometimes this content, using client side scripts to display and control data, can be difficult to re-produce because at times, the
content is not static, but dynamic in nature and dependent upon other information (via a user's browser).
One specific example of this data in the archive is for any data in the archive from NBC
News (HTML and PDF archives are not available for those sources). Because this source displays content using a number of different techniques,
when a human-less bot tries to access the information, the data is not available to archive (resulting in a 404 error). The archive is
doing its best to update the functionality of its news collector bot to be able to better handle these types of situations, and when
time allows, to update those archives manually.
Why does the HTML or PDF archive look different than the original source document?
- Just like some online content is can be difficult to archive, the same can be said for the content's layout and how it is shown to the
user in a browser after that browser has rendered that data, versus the actual code and data itself (via a bot). The main purposes of
archiving the HTML is to be able to search that data if needed, while the PDF archive looks to try and have a more rendered browser page.
How come some articles have a PDF link but there is no PDF found?
- Because PDFs are created in a separate process (using the HTML archive) while the PDF link is there, the actual content may not be right
away. This is currently being looked at to integrate this step directly into the first phase of the data collection process.
When is data archived?
While data is collected on a daily and ongoing basis the main purpose is
to gather and archive data to maintain a historical record (after being looked through) and archived at intervals throughout the month.
How long does the archive go back with data?
Currently the archive has content from late February and as time allows will back-filled where there are gaps during specific date ranges.
Don't the sources own their data? How can others "own" the data as well?
- The Asian American CV19 Archive Project is based, ideologically, on the same tenets as search engines, journalistic organizations, Fair Use,
and the ideas behind the open source movement.
The Asian American CV19 Archive Project makes no revenue from its community service, and makes the data archive
available for free via its site and the corresponding GitHub repository.
Archives are specifically kept because sites or content can be taken down either
due to editorial reasons, organizations closing their doors (and subsequently their online services) for any variety of reasons, where
that content can disappear.
As the mission of the Asian American CV19 Archive Project is to ensure an accurate historical snapshot
of bias, xenophobia, displacement, and racial ambiguity against the Asian American and Asian community during the Covid-19 pandemic, it
is paramount that the project archives data so this record may never be erased and forgotten.
At the same time, the Asian American CV19 Archive Project is just that--an archival and data gathering project, where the original
owners and authors of the content being aggregated, as should, own all copyrights to their specific content data, etc.
If searching in the "Data View" does it search the full html archive?
At this time, the search only searches the data show. It does not search the full document content. In the future a general search of the
archived material may be added to the site.
Why do some Site values have a organization name and others have a top-level url?
This is in part due to how the data was collected and what is available via all of the meta data for that particular document.
Data Sharing And GitHub Repository Information
While the current site has both data and a user interface for consumption by the general community, in order to share all the data from the archive,
GitHub is used as a way to download, copy, and make completely new repositories of data.
What is GitHub? Isn't this more for software development and code?
GitHub originated, and is still used primarily, by developers and software engineers to share and collaborate on projects. A developer
of an application can make their code available to ALL developers on GitHub to freely use and make their own, while also being able to interact
with the main project repository by creating requests to add code back into the main application via a controlled (and easy) set of tools and
Using these same tenets of software development, the Asian American CV19 Archive Project is using GitHub and its project repository in much
the same way where any individual or organization can:
- Download the archive and all the data
- Create a "Fork" of the original project where that individual or organization then owns
that fork--essentially the archive and all of the
- Collaboration via Pull Requests from new Forked repositories to include new data into the archive.
What data, and what format is the data available in via the GitHub repository?
- Listing of Archived Documents: Currently, as only non-social media listings are being archived, the main listing of archived documents
are stored in one document, formatted as JSON. The choice for using JSON for the main listing data format is because
back-end systems and datasources.
- Associated Image data: This data is stored as either jpgs, pngs, or gifs in its own directory where the name correlates to the
main JSON document listing reference.
- HTML Archives: These are all stored as HTML files in its own directory where the name correlates to the main JSON document listing reference.
- PDF Archives: These are all stored as PDF files it its own directory where the name, like all other assets, correlates to the main
JSON document listing reference.
Please see the GitHub repository for more information on data, formats, and retrieval of the archive.
What is the url to the GitHub repository?
When is data updated in the GitHub repository?
- Data is updated from the asianamericancv19archiveproject.org to GitHub at least twice per month, however can be as many times as once per week.