-
Notifications
You must be signed in to change notification settings - Fork 601
Additional Clarity on Size Field #101
Comments
It's unclear to me what the purpose is of the size field. Especially when working with API and web services, 'size' depends on the specific request for (a subset of) the data and the format the data is returned in. Is it the size of a zipped file that's made available or the unzipped data? what is the size of the Landsat archive (millions of images collected over several decades) vs a picture of NDVI generated from this archive (on-the-fly as part of the web service request) for a small portion of the US? |
Good point - I thought about this myself. Government still distributes tons of data via raw file, probably way more often than via APIs or web services. In many circumstances, raw file access is probably the best way to do this, and accessURL (which size is linked to) is inclusive of direct download to raw files. When accessURL is linked to a raw file, I would say that showing the size of the file is a good practice. If, as an extreme example, we linked to a gigantic zip file of the entire Landsat archive (but probably more realistically something like FDA SPL data, which is distributed in CSV), people should know the size of the file before they click on it, in particular if someone on a mobile connection wants to quickly check out some tabular data but doesn't realize the file is actually 300,000 records in a 40mb file or something. As a side note, this is only useful if size changes as the file itself changes, which would necessitate either human intervention or server-side automation by agencies to update on a regular basis. |
Just your last point would make me concerned about relying on the currency/accuracy of the size attribute whenever I see it. People just don't (manually) update this type of metadata. This is speaking from over 10 years working with Geospatial One-Stop and Data.gov in the US and various National Spatial Data Infrastructures globally. On the web, whenever I click a link to download a file, my browser tells me how much bytes I'm about to download. That's directly associated with the actual file/stream/thing I'm about to download. Isn't that enough information for someone to decide to continue or not? You describe a use case where someone is on a mobile device wanting to get some data. Do you know if there's an activity related to Data.gov to collect/define/design the various use cases? What IS the expected use of Data.gov in that sense? Are there apps (mobile/web/desktop/...) that people are building using datasets/services found at Data.gov that would then be used for the things you describe? Would those apps be findable at Data.gov? |
I agree with your point that this is not something we can reasonably expect that people will manually update, hence my point about it (hopefully) being automated. I'm not aware of an activity for data.gov to collect use cases, though I believe http://next.data.gov/ is hoping to achieve that to some extent. Hopefully one of the authors of the schema can chime in here on why they felt size should be included. I'm with you on a lot of your points, and I admit that mobile downloads of data is pretty edge use case, and most other use cases I can think of (either bandwidth-constrained, bandwidth-capped, or storage constrained environments) would be negated by the fact that we don't really have a way of ensuring this value is correct in the first place. |
The field SIZE was used in the standard Data.gov Metadata template in the With regards to changing the name of the field: As I age, I am becoming On Wed, Jul 31, 2013 at 1:41 AM, Sean Herron [email protected]:
Marion A. Royal PMP |
@MarionRoyal: Thanks for the background. In regards to converting size to recognize bytesize, I'm imagining that the schema ought mandate values be given in bytes rather than just allowing for byte values, otherwise we still have the issues I brought up in the original post, right? |
@seanherron: If you are asking me if I think we also need a sizeUnit, I so, I would agree with changing the field name to byteSize (since it On Wed, Jul 31, 2013 at 11:19 AM, Sean Herron [email protected]:
Marion A. Royal PMP |
I like @MarionRoyal's idea to adopt DCAT:byteSize -- but flag that not everyone knows what a byte is, so we should link to some sort of basic calculator folks can use to convert from more-familiar KB/MB/GB. This topic has come up a lot. Ultimately, the size field is not a deeply reliable measure if we are asking people to populate it by hand, because file size changes if so much as a punctuation mark is edited in the source file, and is largely meaningless when applied to APIs, as outlined above. I think those of us who are more technical appreciate this, but we could stand to be clearer to the less-technical folks that they should not be using this field for any sort of precise measurement or for compliance purposes. Since it's not precise, I am less inclined to make it fully machine-readable with separate size and sizeUnit is overkill, because if you're machine-reading you can probably also automatically calculate files' true sizes. |
I think the only thing that scales at the relatively crude level of discovery metadata currently being discussed is to do as @MarionRoyal suggests and leave it as a rough textual notification to downstream users. Best practice would be to include some type of units or explanation in the attribute so a human reading it might have a clue on what they are getting into. Otherwise, we'd need to look across various standards on how the magnitude of a given asset might be described and account for all the specifics. |
+1 for keeping this more textual and the use of letters K, M, G, T, P. I'm envisioning the spectrum of catalog creators and think that the low bar is appropriate here. I also don't think there'll be many use cases for machine-consumption of this field. If so, wouldn't it then be best to stick with filesize so as to avoid the need for everyone to go to a filesize catalog each time? |
It seems like we're all in agreement that the field isn't particularly useful or relevant, so I'm going to go against my original idea and say we just leave as is to prevent complication. Maybe in the future if we look to pare down the schema this would be a good field to deprecate. |
Is this a duplicate of #55? |
Yes, looks like it. I can close this and reference 55 if you'd like. Didn't come across it when I was posting. |
@seanherron I've only skimmed the discussion in this thread, but makes sense! |
The size field, which is intended to represent the file size of the resource, has many potential areas for improvement.
First off, we should either decide on a standard unit of measurement (like the dct:bytesize) or break out the unit of measurement from the numeric value (eg. two fields, size (numeric value) and sizeUnit (unit of measurement). This would allow for machine reading of the value, enabling users to sort and filter by the size of the resource, as well as reducing confusion when multiple standards of measurement are used.
Secondly, what is the rationale behind the cardinality enabling multiple values? Is this in relation to the possibility of having multiple accessURLs? If so, how do we draw a link between a specific accessURL and its size? We could specify that they be represented in order? I'm not a huge fan of that approach but I can't think of a better way to do it.
Finally, I think this should be renamed either bytesize (if it is represented in bytes) or filesize (if represented in other units of measurement). The rationale for this is that size can be interpreted to mean a variety of things (eg. size of geographic area covered, number of rows of data, etc). Bytesize or filesize clarify this.
The text was updated successfully, but these errors were encountered: